Is the usage of the "momentum" significiantly superior to the conventional weight update

The momentum adds a little of the history of the last weight updates to the actual update, with diminishing weight history (older momentum shares get smaller). Is it significiantly superior?

Weightupdate: $$ w_{i+1} = w_i + m_i $$

With momentum: $$ m_0 = 0 \\ m_1 = \Delta w_{1} + \beta m_0 = \Delta w_1 \\ m_2 = \Delta w_{2} + \beta m_1 = \Delta w_2 + \beta\Delta w_1 $$

So the momentum already contains the actual weightupdate and the momentumhistory. beta is like alpha a number between 0 and 1 (beta diminishes older momentum items).

Is there a common consent, that the usage of the momentum approach improves the learning quality, in terms of stability and speed?

Topic momentum backpropagation gradient-descent deep-learning machine-learning

Category Data Science


Using momentum is a noise reduction (noisy gradients) and signal amplification strategy.

Imagine a large hill with a rough terrain with lots of ups-and-downs. We are trying to navigate to the bottom of the hill by using purely local information. A bad strategy is course correct frequently every time we see a potential new direction with steeper descent.

The momentum term makes it difficult to course correct frequently by adding weight to the past directions chosen. It dampens (pun intended) the idea of enthusiastically switching directions.

The challenge with common consent: the exact nature of the terrain is going to change from problem to problem (dataset). So your mileage by choosing momentum based updates may vary. But in general, for non-toy problem using momentum based weight updates will be a good first strategy.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.