Adam Optimiser First Step
Plotting the paths on the cost surface from different gradient descent optimisers on a toy example, I found that the Adam algorithm does not initially travel in the direction of steepest gradient (vanilla gradient descent did). Why might this be?
Later steps where affected by momentum etc but I would assume these effects wouldn’t come into play for the first few steps.
Topic momentum gradient-descent neural-network optimization machine-learning
Category Data Science