Why does NAG cause unstable validation loss?

I'm building a neural network for a classification problem. When playing around with some hyperparameters, I was surprised to see that using Nesterov's Accelerated Gradient instead of vanilla SGD makes a huge difference in the optimization process.

When I use vanilla SGD, optimization is really smooth. Training and validation decrease at a similar rate, and seem to converge properly past a sufficiently large number of epochs:

However, when I switch to NAG, without changing any other parameters, suddenly the validation loss fluctuates a lot, and starts to increase after a small number of epochs:

To my understanding, this type of behaviour is usually caused by overfitting. It seems like NAG converges to a region of optimization space which is significantly less generalizable, but I'm not sure why. I didn't find any papers which explain why NAG generalizes poorly in some cases.

Topic momentum gradient-descent neural-network optimization machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.