Why does NAG cause unstable validation loss?
I'm building a neural network for a classification problem. When playing around with some hyperparameters, I was surprised to see that using Nesterov's Accelerated Gradient instead of vanilla SGD makes a huge difference in the optimization process.
When I use vanilla SGD, optimization is really smooth. Training and validation decrease at a similar rate, and seem to converge properly past a sufficiently large number of epochs:
However, when I switch to NAG, without changing any other parameters, suddenly the validation loss fluctuates a lot, and starts to increase after a small number of epochs:
To my understanding, this type of behaviour is usually caused by overfitting. It seems like NAG converges to a region of optimization space which is significantly less generalizable, but I'm not sure why. I didn't find any papers which explain why NAG generalizes poorly in some cases.
Topic momentum gradient-descent neural-network optimization machine-learning
Category Data Science