How exactly do you implement SGD with momentum?

I am looking up sources to implement SGD with momentum, but they are giving me different equations.

(beta is the momentum hyper-parameter, weights[l] is a matrix of weights for layer l, gradients[l] are the gradients for layer l, etc)

This source gives:

v[l] = beta*v[l] - learning_rate*gradients[l]
weights[l] = weights[l] + v[l]

But this source gives:

v[l] = beta*v[l] + learning_rate*gradients[l]
weights[l] = weights[l] - v[l]

Are they equivalent?

Also, does it matter if beta + learning_rate != 1? (In this case this would be different from the exponential moving average equation, where they sum to 1).

Topic sgd implementation

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.