gradient descent diverges extremely

I have manually created a random data set around some mean value and I have tried to use gradient descent linear regression to predict this simple mean value. I have done exactly like in the manual and for some reason my predictor coefficients are going to infinity, even though it worked for another case. Why, in this case, can it not predict a simple 1.4 value? clear all; n=10000; t=1.4; sigma_R = t*0.001; min_value_t = t-sigma_R; max_value_t = t+sigma_R; y_data …
Category: Data Science

Understanding Learning Rate in depth

I am trying to understand why the learning rate does not work universally. I have two different data sets and have tested out three learning rates 0.001 ,0.01 and 0.1 . For the first data set, I was able to achieve results for all learning rates at optimization using stochastic gradient descent. For the second data set the learning rate 0.1 did not converge. I understand the logic behind it overshooting the gradients, however, I'm failing to understand why this …
Category: Data Science

Why is each successive tree in GBM fit on the negative gradient of the loss function?

Page 359 of Elements Of Statistical Learning 2nd edition says the below. Can someone explain the intuition & simplify it in layman terms? Questions What is the reason/intuition & math behind fitting each successive tree in GBM on the negative gradient of the loss function? Is it done to make GBM more generalization on unseen test dataset? If so how does fitting on negative gradient achieve this generalization on test data?
Category: Data Science

Vanishing gradient problem even after existence of ReLu function?

Let's say I have a deep neural network with 50 hidden layers and at each neuron of hidden layer the ReLu activation function is used. My question is Is it possible for vanishing gradient problem to get occur during the backpropogation for weights updates even after the existence of relu? or we can say that vanishing gradient problem will never occur when all the activation functions are ReLu?"
Category: Data Science

calculating gradient descent

when using mini batch gradient descent , we perform backpropagation after each batch , ie we calculate the gradient after each batch , we also capture y-hat after each sample in the batch and finally calculate the loss function all over the batch , we use this latter to calculate the gradient, correct ? now as the chainrule states we calculate the gradient this way for the below neural network: the question is if we calculate the gradient after passing …
Category: Data Science

Is it beneficial to use a batch size > 1 even when all computing power can be used?

In regards to training a neural network, it is often said that increasing the batch size decreases the network's ability to generalize, as alluded to here. This is due to the fact that training on large batches causes the network to converge to sharp minimas, as opposed to wide ones, as explained here. This begs the question: In situations where all available computing power can be used by training on a batch size of one, is there a benefit to …
Category: Data Science

how to calculate loss function?

i hope you are doing well , i want to ask a question regarding loss function in a neural network i know that the loss function is calculated for each data point in the training set , and then the backpropagation is done depending on if we are using batch gradient descent (backpropagation is done after all the data points are passed) , mini-batch gradient descent(backpropagation is done after batch) or stochastic gradient descent(backpropagation is done after each data point). …
Category: Data Science

'Solvers' in Machine Learning

What role do 'Solvers' play in optimization problems? Surprisingly, I could not find any definition for 'Solvers' online. All the sources I've referred to just explain the types of solvers & the conditions under which each one is supposed to be used. Examples of Solvers - ['Newton-cg', 'lbfgs', 'liblinear', 'sag,' 'saga']
Category: Data Science

Understanding SGD for Binary Cross-Entropy loss

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss. The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set. For binary cross entropy loss, I am using …
Category: Data Science

Verifying my understanding of MLE & Gradient Descent in Logistic Regression

Here is my understanding of the relation between MLE & Gradient Descent in Logistic Regression. Please correct me if I'm wrong: 1) MLE estimates optimal parameters by taking the partial derivative of the log-likelihood function wrt. each parameter & equating it to 0. Gradient Descent just like MLE gives us the optimal parameters by taking the partial derivative of the loss function wrt. each parameter. GD also uses hyperparameters like learning rate & step size in the process of obtaining …
Category: Data Science

Why do we only care about convex functions when doing Gradient Descent/SGD?

I mean I know why we specifically care about convex functions: it's because their local minimum are also global, and so you just have to "follow a path which goes down" to find the minima of the function. However, there are also functions which are not convex, but for which local minima are also global minima, for example, a function which looks like this: Isn't there a way to characterize every function which "works well" with gradient descent? Something like …
Category: Data Science

How do I deal with non-IID data in gradient boosted random forest (for stock market)?

I am working on a stock market decision system. I have currently centered on gradient boosting as the likely best machine learning solution for the problem. However, I have 2 fundamental issues with my data owing to it being from the stock market having to do with it not being IID. First, because of the duration of average in some indicators use, some data-points are highly correlated. For example, the 2-year trailing return of a stock is not very different …
Category: Data Science

How do you find the eigenvalues of the matrix for the following momentum gradient descent?

The following question is based purely on the material available on MIT's open courseware youtube channel. (https://www.youtube.com/watch?v=wrEcHhoJxjM). In it, Professor Gilbert Strang explains the general formulation of the momentum gradient descent problem and ultimately arrives at optimum values (40:05 in the video) for the variables $s$ and $\beta$. \ $\textbf{Background}$ Lets begin with the standard gradient descent not covered in this video. The equation for this is: $x_{k+1}=x{k}-s \nabla f(x_k) $ $s$ is the step size, $f(x_k)$ is the value …
Category: Data Science

Learning parameters when loss is a piecewise function

I have a network to generate a single number $T$. I know in advance: a property of the loss function is that, when $T \in [a_1, a_2]$, the loss has the same value $L_1$; when $T \in [a_2, a_3]$, the loss has another value $L_2$; etc. The loss function resembles a piecewise function. A concrete, simplified example of this problem is perhaps something like object classification. I have a set of objects, and their distances to a category $C$ that …
Category: Data Science

ResNet: Derive the gradient matrices w.r.t. W1 and W2 and backprop equation in a Residual Network

How would I go about step by step deriving stochastic gradient matrices w.r.t. W1 and W2 and backpropagation equation in a residual block that is a part of a larger ResNet network with forward propagation expressed as: $$ F(x) = \mathrm{W}_{2}^{} \mathrm{g}_{1}^{}(\mathrm{W}_{1}^{}x) $$ $$ y = \mathrm{g}_{2}^{} (F(x) + x) $$ and $$ \mathrm{g}_{1}^{}, \mathrm{g}_{2}^{} $$ are component-wise non-linear activation functions.
Category: Data Science

From what function do come the gradients that I use to adjust weights?

I have a question about the loss function and the gradient. So I'm following the fastai (https://github.com/fastai/fastbook) course and at the end of 4th chapter, I got myself wondering. From what function do come the gradients that I use to adjust weights? I do understand that loss function is being derivated. But which? Can I see it? Or is it under the hood of PyTorch? Code of the step function: def step(self): self.w.data -= self.w.grad.data * self.lr self.b.data -= self.b.grad.data …
Category: Data Science

Difference between OLS and Gradient Descent in Linear Regression

I understand what Ordinary Least Squares and Gradient Descent do but I am just confused about the difference between them. The only difference I can think of are- Gradient Descent is iterative while OLS isn't. Gradient Descent uses a learning rate to reach the point of minima, while OLS just finds the minima of the equation using partial differentiation. Both these methods are very useful in Linear Regression but they both give us the same results: the best possible values …
Category: Data Science

Need help to understand the formula of gradient descent with multiple features

I am trying to implement gradient descent with multiple features after listening to Andrew Ng's Coursera lecture. gradient descent for multiple features So for example when calculating for theta 1, part of the formula requires you to subtract the real y value with the predicted y value (to calculate the error) and at the end of the formula you multiply it with the value of feature 1 in the ith training example, as denoted by the x superscript (i) subscript …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.