sgd - Geeks Mental

Understanding Learning Rate in depth

noooah

2022年5月29日 09:33

I am trying to understand why the learning rate does not work universally. I have two different data sets and have tested out three learning rates 0.001 ,0.01 and 0.1 . For the first data set, I was able to achieve results for all learning rates at optimization using stochastic gradient descent. For the second data set the learning rate 0.1 did not converge. I understand the logic behind it overshooting the gradients, however, I'm failing to understand why this …

Topic: sgd gradient-descent deep-learning optimization machine-learning

Category: Data Science

Understanding SGD for Binary Cross-Entropy loss

Coinman

2022年5月19日 18:07

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss. The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set. For binary cross entropy loss, I am using …

Topic: sgd mathematics multilabel-classification gradient-descent machine-learning

Category: Data Science

ResNet: Derive the gradient matrices w.r.t. W1 and W2 and backprop equation in a Residual Network

Neuro

2022年4月26日 01:46

How would I go about step by step deriving stochastic gradient matrices w.r.t. W1 and W2 and backpropagation equation in a residual block that is a part of a larger ResNet network with forward propagation expressed as: $$ F(x) = \mathrm{W}_{2}^{} \mathrm{g}_{1}^{}(\mathrm{W}_{1}^{}x) $$ $$ y = \mathrm{g}_{2}^{} (F(x) + x) $$ and $$ \mathrm{g}_{1}^{}, \mathrm{g}_{2}^{} $$ are component-wise non-linear activation functions.

Topic: sgd convolutional-neural-network gradient-descent neural-network machine-learning

Category: Data Science

How variable alpha changes SGDRegressor behavior for outlier?

Ross_you

2022年4月16日 06:26

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have: from sklearn.linear_model import SGDRegressor out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] alpha=[0.0001, 1, 100] N= len(out) plt.figure(figsize=(20,15)) j=1 for i in alpha: X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included …

Topic: sgd hyperparameter-tuning outlier python

Category: Data Science

Why does using Gradient descent over Stochatic gradient descent improve performance?

haneulkim

2022年3月19日 00:02

Currently, I'm running two types of logistic regression. logistic regression with SGD logistic regression with GD implemented as follows SGD= SGDClassifier(loss="log",max_iter=1000,penalty='l1',alpha=0.001) logreg = LogisticRegression(solver='liblinear', max_iter=100, penalty='l1', C=0.1) nevermind the hyperparameters as I've used GridsearchCV and tried multiple combinations. When calculating accuracy logistic with GD performs better than SGD. I want to understand why this is the case, is using GD instead SGD one way to mitigate underfitting model?

Topic: sgd gradient-descent logistic-regression python machine-learning

Category: Data Science

How to compute constant c for PCA features before SGDClassifier as advised in Scikit documentation?

Jack

2022年2月7日 11:30

In the documentation for SGDClassifier here, it is stated; If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by some constant c such that the average L2 norm of the training data equals one. Given, I have a dummy training dataset as import numpy as np data = np.random.rand(3,3) How can I compute c and scale the feature values? I am using IncrementalPCA before SGDClassifier (loss=log). Should I …

Topic: sgd pca scikit-learn

Category: Data Science

Understanding the step of SGD for binary classification

com

2022年1月13日 18:04

I cannot understand the step of SGD for binary classification. For example, we have $y$ - true labels $\in \{0,1\}$ and $p=f_\theta(x)$-predicted labels $\in [0,1]$. Then, the update step of SGD is the following $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,f_\theta(x))}{\partial \Theta}$, where L - loss function. Then follows the replacement that I cannot understand $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,p)}{\partial p}| {\scriptscriptstyle p=f_\theta(x)} \frac{\partial f_\theta(x)}{\partial \Theta}$ Why do we need to take the derivate of $\partial p$? Why …

Topic: sgd derivation mathematics

Category: Data Science

Is learning_rate linear with the time to converge using AdamOpt?

Adar Cohen

2022年1月10日 12:22

Say that both learning rates 1e-3,1e-4 leading to the same solution (not too high or too small). In terms of convergence by the amount of epochs, does optim.Adam(model.parameters(), lr=1e-3) compare to optim.Adam(model.parameters(), lr=1e-4) will take 10 time more epoch? So if an optimizer with lr=1e-3 reached the solution at epoch 130, theoretically, an optimizer with lr=1e-4 will get there at epoch 1300? I think that my statement is true in a vanilla SGD, but in Adam's opt there's both momentum …

Topic: sgd learning-rate convergence deep-learning optimization

Category: Data Science

Multiple models have extreme differences during evaluation

Egor

2021年10月9日 12:04

My dataset has about 100k entries, 6 features, and the label is simple binary classification (about 65% zeros, 35% ones). When I train my dataset on different models: random forest, decision tree, extra trees, k-nearest neighbors, logistic regression, sgd, dense neural networks, etc, the evaluations differ GREATLY from model to model. tree classifiers: about 80% for both accuracy and precision k-nearest neighbors: 56% accuracy and 36% precision. linear svm: 65% accuracy and 0 positives guessed sgd : 63% accuracy and …

Topic: sgd decision-trees evaluation accuracy machine-learning

Category: Data Science

How exactly do you implement SGD with momentum?

Bersan

2021年9月28日 14:48

I am looking up sources to implement SGD with momentum, but they are giving me different equations. (beta is the momentum hyper-parameter, weights[l] is a matrix of weights for layer l, gradients[l] are the gradients for layer l, etc) This source gives: v[l] = beta*v[l] - learning_rate*gradients[l] weights[l] = weights[l] + v[l] But this source gives: v[l] = beta*v[l] + learning_rate*gradients[l] weights[l] = weights[l] - v[l] Are they equivalent? Also, does it matter if beta + learning_rate != 1? (In …

Topic: sgd implementation

Category: Data Science

Can't use The SGD optimizer

AAA

2021年8月17日 04:44

I am using the following code: from tensorflow.keras.regularizers import l2 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Add, Conv2D, MaxPooling2D, Dropout, Flatten, Dense, BatchNormalization, Activation from tensorflow.keras import activations CNN_model = Sequential() # The First Block CNN_model.add(Conv2D(128, kernel_size=2,kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same', input_shape=(700, 460, 3))) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The Second Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same')) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The Third Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same')) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The fourth Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', …

Topic: sgd cnn tensorflow

Category: Data Science

Estimating a rbf kernel SVM, followed by Stochastic Gradient Descent

pallidness

2021年7月16日 16:04

I wanna estimate a rbf SVM to predict property prices. My data set has 11 features and roughly 57,000 rows. When I set C=10, R^2 is about 0.88 while MSE and RMSE are 0.1191 and 0.3451. The results are pretty good. Afterward, I estimate a SGD, using linear_model.SGDRegressor and loss='squared_epsilon_insensitive'. When I use adaptive learning rate, R^2 is reduced to 0.75 while MSE and RMSE are 0.2441 and 0.4940, respectively. When I use optimal learning rate, the results are even …

Topic: sgd rbf svm

Category: Data Science

Stochastic Gradient Region of Confusion

radar101

2021年7月5日 09:07

I have come across the following diagram which explains the behavior of SGD graphically. Based on this graphical representation, the gradient of the individual data tend to fluctuate more when it closer to the optimum point, where as far away from this point tends to show towards the optimum point. My question is: Isn't this depends on how we select the points randomly? For example, lets say we first find the gradient of the graph F3 and finds that it …

Topic: sgd gradient-descent

Category: Data Science

Learning rate of 0 still changes weights in Keras

Evator

2021年2月18日 02:49

I just trained a model (SGD) with keras and was wondering why the change of accuracy and loss from epoch to epoch doesn't really decrease that much when I lower the learning rate. So I tested what happens when I set the learning rate to 0 and to my surprise, accuracy and loss still changed from epoch to epoch and I can't find an explanation for that. Does anyone know why this could be happening?

Topic: sgd learning-rate keras

Category: Data Science

Changing the batch size during training

spiridon_the_sun_rotator

2021年1月29日 11:30

The choice of batch size is in some sense the measure of stochasticity : On one hand, smaller batch sizes make the gradient descent more stochastic, the SGD can deviate significantly from the exact GD on the whole data, but allows for more exploration and performs in some sense a Bayesian inference. Larger batch sizes approximate the exact gradient better, but in this way one is more likely to overfit the data or get stuck in the local optimum. Processing …

Topic: sgd mini-batch-gradient-descent bayesian gradient-descent

Category: Data Science

input shape of keras Sequential model

ammar

2021年1月26日 15:01

i am new to neural networks using keras, i have the following train samples input shape (150528, 1235) and output shape is (154457, 1235) where 1235 is the training examples, how to put the input shape, i tried below but gave me a ValueError: Data cardinality is ambiguous: x sizes: 150528 y sizes: 154457 Please provide data which shares the same first dimension. code: import tensorflow as tf from tensorflow import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Activation, …

Topic: sgd mse keras

Category: Data Science

Problem of multi class classification (Sklearn TfidfVectorizer and SGDClassifier)

luky

2020年12月27日 19:48

I do the (text) topic classification using TfidfVectorizer and SGDClassifier, literally I want to classify the website into categories (like Sport, Business etc). Now, the problem is, that each website might fit into multiple categories eg (IT and Eshop, Sport and Eshop) etc. The question can have 2 parts on is how to do it technically (Python, SKLEARN) another is a theoretical question. Let me explain the latter. As far as I understand how the text classification works, it cannot …

Topic: sgd scikit-learn classification python

Category: Data Science

Confused between optimizer and loss function

Hanna polaskus

2020年11月19日 02:17

I always thought the SGD was a loss function then I read this on a notebook model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.SGD(lr=1e-3), metrics=["accuracy"]) now I am confused what's the difference between loss and optimizer ? are they both used at the output layer to calculate the loss? or is the optimizer something used in each layer?

Topic: sgd keras loss-function optimization

Category: Data Science

The central idea behind SGD

Green Falcon

2020年10月10日 09:22

Pr. Hinton in his popular course on Coursera refers to the following fact: Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t work with mini-batches? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive …

Topic: sgd deep-learning neural-network machine-learning

Category: Data Science

About