The central idea behind SGD

Pr. Hinton in his popular course on Coursera refers to the following fact:

Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t work with mini-batches? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive mini-batches. Consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch. What we’d like is to those gradients to roughly cancel each other out, so that the stay approximately the same. But it’s not what happens with rprop. With rprop, we increment the weight 9 times and decrement only once, so the weight grows much larger.

As you can see, the central idea behind SGD is that the successive gradients in mini-batches should be averaged. Does any one have any valid formal source for this? is there any justification? I've not encountered to any proof till now.

Topic sgd deep-learning neural-network machine-learning

Category Data Science


Influence of the data generating distribution

To see this, first we have to mention that, neither by using Batch gradient descent (using the whole dataset to compute the gradient) nor using mini-batch gradient descent, we are computing the true (exact) value of the gradient.

To compute the true value of the gradient we would have to use the set of all possible values of the features, $x$, (and thereby the outputs $y$).

More formally, and refering to the quantity that we want to minimize as the expected value of the per-example loss function ($J(x,y,\theta)$, where $\theta$ are the parameters) w.r.t all the possible $x,y$ values, the true gradient $g$ is given by: $$g = \frac{\partial}{\partial \theta}\mathbb{E}_{x,y\sim p_{data}}(J(x,y,\theta)) $$ And if we assume certain conditions We have that: $$g = \mathbb{E}_{x,y\sim p_{data}}\left(\frac{\partial}{\partial \theta}J(x,y,\theta)\right) $$

Where $p_{data}$ is the data generating distribution (the distribution from which the values of $x$ and $y$ are drawn). However, this data generating distribution is usually unkown. We just know the dataset that we are given.

Because of this, to update the parameters using all the information given (the training set), we instead use the empirical ditribution defined by the training data ($\hat{p}_{data}$) which puts a probability of $1/m$ on each of the $m$ samples $(x^{(1)}, y^{(1)}), \,(x^{(2)}, y^{(2)}),\,...\,,(x^{(m)}, y^{(m)})$ of the dataset. So the gradient is approximated by: $$ \begin{aligned} \hat{g}&=\frac{\partial}{\partial \theta}\mathbb{E}_{x,y\sim \hat{p}_{data}}(J(x,y,\theta))\\&=\frac{\partial}{\partial \theta}\left(\sum_{i=1}^m \frac{1}{m}J_i(x^{(i)},y^{(i)},\theta)\right)\\ &= \frac{1}{m}\sum_{i=1}^m\frac{\partial }{\partial \theta}J_i(x^{(i)},y^{(i)},\theta) \end{aligned} $$ Ending up with the Batch gradient descent.

But what happens with mini-batches?

By using mini-bath updates, we are continuously seeing new data (assuming that we compute just one epoch). So in this case, using mini-batches, we are using the data generating distribution.

This means that on each mini-batch update, by sampling this data generating distribution, we end up with an estimation ($\hat{g}$) of the true gradient ($g$) that is unbiased i.e. $\mathbb{E}_{x,y\sim p_{data}}(\hat{g})=g$. To see this, and considering $\text{s-sized}$ mini-batches: $$\begin{aligned} \mathbb{E}_{x,y\sim p_{data}}(\hat{g})&=\mathbb{E}_{x,y\sim p_{data}}\left(\frac{g^{(1)}+...+g^{(s)}}{s}\right)\\ &=\frac{1}{s}(\mathbb{E}_{x,y\sim p_{data}}(g^{(1)}+...+g^{(s)}))\\ &=\frac{1}{s}s\,\,g=g \end{aligned} $$ Thereby, making succesive mini-batch updates we would be tending in average (as shown by $\mathbb{E}_{x,y\sim p_{data}}(\hat{g})$) to updating our parameters with the true value of the gradient. And this is what I think the authors refer to in the quote of the question.


Great references:

Deep Learning book, Ian Goodfellow et. al Chapter 8.1
Answers from here


In a full gradient descent step, the loss function is defined as the average of the loss term at individual sample points. To minimize the loss function, we need to average over the individual gradients.

In the stochastic gradient descent, if there is no bias in selecting the batches, the averaging over the batches would result in a unbiased estimate of the full gradient.

Please take a look at this lecture notes http://www.stat.cmu.edu/~ryantibs/convexopt-F18/scribes/Lecture_24.pdf

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.