SGD versus Adam Optimization Clarification
Reading the Adam paper, I need some clarificaiton.
It states that SGD optimization updates the parameters with the same learning rate (i.e. it does not change throughout training). They state Adam is different as learning rate is variable (adaptive), and can change during training.
Is this the primary difference why Adam performs (for most cases) better than SGD? Also, it states that it is computationally cheaper, how can this be given that it seems more complex than SGD?
I hope my questions are clear!
Topic neural-network optimization
Category Data Science