How exactly do you implement SGD with momentum?
I am looking up sources to implement SGD with momentum, but they are giving me different equations.
(beta
is the momentum hyper-parameter, weights[l]
is a matrix of weights for layer l
, gradients[l]
are the gradients for layer l
, etc)
This source gives:
v[l] = beta*v[l] - learning_rate*gradients[l]
weights[l] = weights[l] + v[l]
But this source gives:
v[l] = beta*v[l] + learning_rate*gradients[l]
weights[l] = weights[l] - v[l]
Are they equivalent?
Also, does it matter if beta + learning_rate != 1
? (In this case this would be different from the exponential moving average equation, where they sum to 1).
Topic sgd implementation
Category Data Science