Why does momentum need learning rate?

Question

Why does momentum need learning rate?

Kari

2018年5月11日 17:21

If momentum optimizer independently keeps a custom "inertia" value for each weight, then why do we ever need to bother with learning rate?

Surely, momentum would catch up its magnutude pretty quickly to any needed value anyway, why to bother scaling it with learning rate?

$$v_{dw} = \beta v_{dw} +(1-\beta)dW$$ $$W = W-\alpha v_{dw}$$

Where $\alpha$ is the learning rate (0.01 etc) and $\beta$ is the momentum coefficient (0.9 etc)

Edit

Thanks for the answer! To put it more plain: momentum controls "how well we retain" the movement, and learning rate is "how fast do we reGain" the movement

Topic momentum backpropagation neural-network

Category Data Science

Benji · Accepted Answer · 2018年4月21日 04:36

To answer the first question about why we need the learning rate even if we have momentum, let's consider an example in which we are not using the momentum term. The weight update is therefore:

$ \Delta w_{ij} = \frac{\partial E}{\partial w_{ij}} \cdot l $

where:

$ \Delta w_{ij} $ is the weight update
$ \frac{\partial E}{\partial w_{ij}} $ is the gradient of the error with respect to the weight
$ l \space $ is the learning rate coefficient

Our weight update is determined by the gradient of our current error with respect to the weight at node $ ij $. Therefore, our prior weight deltas are not factored into our weight update equation. If we were to eliminate the learning rate, our weights would not update.

Now let's consider an example using the momentum term in its derived form:

$ \Delta w_{ij} = (\frac{\partial E}{\partial w_{ij}} \cdot l) + (\mu \cdot \Delta w^{t-1}_{ij}) $

where:

$ \mu $ is the momentum coefficient
$ \Delta w^{t-1}_{ij} $ is the weight update of node $ ij $ from the previous epoch

Now we are factoring the previous weight delta in our weight update equation. In this form, it is easier to see that the learning rate and momentum are effectively independent terms. However, without a learning rate, our weight delta would still be zero.

Now you might ask: what if we remove the learning rate after getting an initial momentum value so that momentum is the sole influence of the weight delta?

This destroys the backpropagation algorithm.

The objective of backprop is to optimize the weights to minimize error. We achieve this minimization by adjusting the weights according to the error gradient.

Momentum, on the other hand, aims to improve the rate of convergence and to avoid local minimas. The momentum term does not explicitly include the error gradient in its formula. Therefore, momentum by itself does not enable learning.

If you were to only use momentum after establishing an initial weight delta, the weight update equation would look as such:

$ \Delta w_{ij} = (\mu \cdot \Delta w^{t-1}_{ij}) $

and:

$ \lim_{t \to \infty} \Delta w^t_{ij} = \begin{cases} 0 & | \space \mu < 1 \space \lor \space (\mu = 1 \space \land \space \Delta w^{t=0}_{ij} < 1) \\ 1 & | \space \mu = 1 \land \space \Delta w^{t=0}_{ij} = 1\\ \infty & | \space otherwise \end{cases} $

Although there exists a scenario where the weight delta approaches zero, this descent is not based on the error gradient and is in fact predetermined by the momentum coefficient and the initial weight delta: this weight delta does not achieve our objective to minimize the error and is therefore useless.

TL;DR:

The learning rate is critical for updating the weights to minimize error. Momentum is used to help the learning rate, but not replace it.

Why does momentum need learning rate?

About