To answer the first question about why we need the learning rate even if we have momentum, let's consider an example in which we are not using the momentum term. The weight update is therefore:
$ \Delta w_{ij} = \frac{\partial E}{\partial w_{ij}} \cdot l $
where:
- $ \Delta w_{ij} $ is the weight update
- $ \frac{\partial E}{\partial w_{ij}} $ is the gradient of the error with respect to the weight
- $ l \space $ is the learning rate coefficient
Our weight update is determined by the gradient of our current error with respect to the weight at node $ ij $. Therefore, our prior weight deltas are not factored into our weight update equation. If we were to eliminate the learning rate, our weights would not update.
Now let's consider an example using the momentum term in its derived form:
$ \Delta w_{ij} = (\frac{\partial E}{\partial w_{ij}} \cdot l) + (\mu \cdot \Delta w^{t-1}_{ij}) $
where:
- $ \mu $ is the momentum coefficient
- $ \Delta w^{t-1}_{ij} $ is the weight update of node $ ij $ from the previous epoch
Now we are factoring the previous weight delta in our weight update equation. In this form, it is easier to see that the learning rate and momentum are effectively independent terms. However, without a learning rate, our weight delta would still be zero.
Now you might ask: what if we remove the learning rate after getting an initial momentum value so that momentum is the sole influence of the weight delta?
This destroys the backpropagation algorithm.
The objective of backprop is to optimize the weights to minimize error. We achieve this minimization by adjusting the weights according to the error gradient.
Momentum, on the other hand, aims to improve the rate of convergence and to avoid local minimas. The momentum term does not explicitly include the error gradient in its formula. Therefore, momentum by itself does not enable learning.
If you were to only use momentum after establishing an initial weight delta, the weight update equation would look as such:
$ \Delta w_{ij} = (\mu \cdot \Delta w^{t-1}_{ij}) $
and:
$ \lim_{t \to \infty} \Delta w^t_{ij} =
\begin{cases}
0 & | \space \mu < 1 \space \lor \space (\mu = 1 \space \land \space \Delta w^{t=0}_{ij} < 1) \\
1 & | \space \mu = 1 \land \space \Delta w^{t=0}_{ij} = 1\\
\infty & | \space otherwise
\end{cases}
$
Although there exists a scenario where the weight delta approaches zero, this descent is not based on the error gradient and is in fact predetermined by the momentum coefficient and the initial weight delta: this weight delta does not achieve our objective to minimize the error and is therefore useless.
TL;DR:
The learning rate is critical for updating the weights to minimize error. Momentum is used to help the learning rate, but not replace it.