Is the usage of the "momentum" significiantly superior to the conventional weight update
The momentum adds a little of the history of the last weight updates to the actual update, with diminishing weight history (older momentum shares get smaller). Is it significiantly superior?
Weightupdate: $$ w_{i+1} = w_i + m_i $$
With momentum: $$ m_0 = 0 \\ m_1 = \Delta w_{1} + \beta m_0 = \Delta w_1 \\ m_2 = \Delta w_{2} + \beta m_1 = \Delta w_2 + \beta\Delta w_1 $$
So the momentum already contains the actual weightupdate and the momentumhistory. beta is like alpha a number between 0 and 1 (beta diminishes older momentum items).
Is there a common consent, that the usage of the momentum approach improves the learning quality, in terms of stability and speed?
Topic momentum backpropagation gradient-descent deep-learning machine-learning
Category Data Science