Understanding Learning Rate in depth

I am trying to understand why the learning rate does not work universally. I have two different data sets and have tested out three learning rates 0.001 ,0.01 and 0.1 . For the first data set, I was able to achieve results for all learning rates at optimization using stochastic gradient descent. For the second data set the learning rate 0.1 did not converge. I understand the logic behind it overshooting the gradients, however, I'm failing to understand why this …
Category: Data Science

Understanding SGD for Binary Cross-Entropy loss

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss. The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set. For binary cross entropy loss, I am using …
Category: Data Science

ResNet: Derive the gradient matrices w.r.t. W1 and W2 and backprop equation in a Residual Network

How would I go about step by step deriving stochastic gradient matrices w.r.t. W1 and W2 and backpropagation equation in a residual block that is a part of a larger ResNet network with forward propagation expressed as: $$ F(x) = \mathrm{W}_{2}^{} \mathrm{g}_{1}^{}(\mathrm{W}_{1}^{}x) $$ $$ y = \mathrm{g}_{2}^{} (F(x) + x) $$ and $$ \mathrm{g}_{1}^{}, \mathrm{g}_{2}^{} $$ are component-wise non-linear activation functions.
Category: Data Science

How variable alpha changes SGDRegressor behavior for outlier?

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have: from sklearn.linear_model import SGDRegressor out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] alpha=[0.0001, 1, 100] N= len(out) plt.figure(figsize=(20,15)) j=1 for i in alpha: X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included …
Category: Data Science

Why does using Gradient descent over Stochatic gradient descent improve performance?

Currently, I'm running two types of logistic regression. logistic regression with SGD logistic regression with GD implemented as follows SGD= SGDClassifier(loss="log",max_iter=1000,penalty='l1',alpha=0.001) logreg = LogisticRegression(solver='liblinear', max_iter=100, penalty='l1', C=0.1) nevermind the hyperparameters as I've used GridsearchCV and tried multiple combinations. When calculating accuracy logistic with GD performs better than SGD. I want to understand why this is the case, is using GD instead SGD one way to mitigate underfitting model?
Category: Data Science

How to compute constant c for PCA features before SGDClassifier as advised in Scikit documentation?

In the documentation for SGDClassifier here, it is stated; If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by some constant c such that the average L2 norm of the training data equals one. Given, I have a dummy training dataset as import numpy as np data = np.random.rand(3,3) How can I compute c and scale the feature values? I am using IncrementalPCA before SGDClassifier (loss=log). Should I …
Category: Data Science

Understanding the step of SGD for binary classification

I cannot understand the step of SGD for binary classification. For example, we have $y$ - true labels $\in \{0,1\}$ and $p=f_\theta(x)$-predicted labels $\in [0,1]$. Then, the update step of SGD is the following $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,f_\theta(x))}{\partial \Theta}$, where L - loss function. Then follows the replacement that I cannot understand $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,p)}{\partial p}| {\scriptscriptstyle p=f_\theta(x)} \frac{\partial f_\theta(x)}{\partial \Theta}$ Why do we need to take the derivate of $\partial p$? Why …
Category: Data Science

Is learning_rate linear with the time to converge using AdamOpt?

Say that both learning rates 1e-3,1e-4 leading to the same solution (not too high or too small). In terms of convergence by the amount of epochs, does optim.Adam(model.parameters(), lr=1e-3) compare to optim.Adam(model.parameters(), lr=1e-4) will take 10 time more epoch? So if an optimizer with lr=1e-3 reached the solution at epoch 130, theoretically, an optimizer with lr=1e-4 will get there at epoch 1300? I think that my statement is true in a vanilla SGD, but in Adam's opt there's both momentum …
Category: Data Science

Multiple models have extreme differences during evaluation

My dataset has about 100k entries, 6 features, and the label is simple binary classification (about 65% zeros, 35% ones). When I train my dataset on different models: random forest, decision tree, extra trees, k-nearest neighbors, logistic regression, sgd, dense neural networks, etc, the evaluations differ GREATLY from model to model. tree classifiers: about 80% for both accuracy and precision k-nearest neighbors: 56% accuracy and 36% precision. linear svm: 65% accuracy and 0 positives guessed sgd : 63% accuracy and …
Category: Data Science

How exactly do you implement SGD with momentum?

I am looking up sources to implement SGD with momentum, but they are giving me different equations. (beta is the momentum hyper-parameter, weights[l] is a matrix of weights for layer l, gradients[l] are the gradients for layer l, etc) This source gives: v[l] = beta*v[l] - learning_rate*gradients[l] weights[l] = weights[l] + v[l] But this source gives: v[l] = beta*v[l] + learning_rate*gradients[l] weights[l] = weights[l] - v[l] Are they equivalent? Also, does it matter if beta + learning_rate != 1? (In …
Category: Data Science

Can't use The SGD optimizer

I am using the following code: from tensorflow.keras.regularizers import l2 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Add, Conv2D, MaxPooling2D, Dropout, Flatten, Dense, BatchNormalization, Activation from tensorflow.keras import activations CNN_model = Sequential() # The First Block CNN_model.add(Conv2D(128, kernel_size=2,kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same', input_shape=(700, 460, 3))) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The Second Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same')) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The Third Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', kernel_regularizer=l2(0.0005), padding='same')) CNN_model.add(Activation(activations.relu)) CNN_model.add(BatchNormalization()) CNN_model.add(MaxPooling2D(2, 2)) # The fourth Block CNN_model.add(Conv2D(128, kernel_size=2, kernel_initializer='he_uniform', …
Category: Data Science

Estimating a rbf kernel SVM, followed by Stochastic Gradient Descent

I wanna estimate a rbf SVM to predict property prices. My data set has 11 features and roughly 57,000 rows. When I set C=10, R^2 is about 0.88 while MSE and RMSE are 0.1191 and 0.3451. The results are pretty good. Afterward, I estimate a SGD, using linear_model.SGDRegressor and loss='squared_epsilon_insensitive'. When I use adaptive learning rate, R^2 is reduced to 0.75 while MSE and RMSE are 0.2441 and 0.4940, respectively. When I use optimal learning rate, the results are even …
Topic: sgd rbf svm
Category: Data Science

Stochastic Gradient Region of Confusion

I have come across the following diagram which explains the behavior of SGD graphically. Based on this graphical representation, the gradient of the individual data tend to fluctuate more when it closer to the optimum point, where as far away from this point tends to show towards the optimum point. My question is: Isn't this depends on how we select the points randomly? For example, lets say we first find the gradient of the graph F3 and finds that it …
Category: Data Science

Learning rate of 0 still changes weights in Keras

I just trained a model (SGD) with keras and was wondering why the change of accuracy and loss from epoch to epoch doesn't really decrease that much when I lower the learning rate. So I tested what happens when I set the learning rate to 0 and to my surprise, accuracy and loss still changed from epoch to epoch and I can't find an explanation for that. Does anyone know why this could be happening?
Category: Data Science

Changing the batch size during training

The choice of batch size is in some sense the measure of stochasticity : On one hand, smaller batch sizes make the gradient descent more stochastic, the SGD can deviate significantly from the exact GD on the whole data, but allows for more exploration and performs in some sense a Bayesian inference. Larger batch sizes approximate the exact gradient better, but in this way one is more likely to overfit the data or get stuck in the local optimum. Processing …
Category: Data Science

input shape of keras Sequential model

i am new to neural networks using keras, i have the following train samples input shape (150528, 1235) and output shape is (154457, 1235) where 1235 is the training examples, how to put the input shape, i tried below but gave me a ValueError: Data cardinality is ambiguous: x sizes: 150528 y sizes: 154457 Please provide data which shares the same first dimension. code: import tensorflow as tf from tensorflow import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Activation, …
Topic: sgd mse keras
Category: Data Science

Problem of multi class classification (Sklearn TfidfVectorizer and SGDClassifier)

I do the (text) topic classification using TfidfVectorizer and SGDClassifier, literally I want to classify the website into categories (like Sport, Business etc). Now, the problem is, that each website might fit into multiple categories eg (IT and Eshop, Sport and Eshop) etc. The question can have 2 parts on is how to do it technically (Python, SKLEARN) another is a theoretical question. Let me explain the latter. As far as I understand how the text classification works, it cannot …
Category: Data Science

Confused between optimizer and loss function

I always thought the SGD was a loss function then I read this on a notebook model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.SGD(lr=1e-3), metrics=["accuracy"]) now I am confused what's the difference between loss and optimizer ? are they both used at the output layer to calculate the loss? or is the optimizer something used in each layer?
Category: Data Science

The central idea behind SGD

Pr. Hinton in his popular course on Coursera refers to the following fact: Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t work with mini-batches? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.