SGD versus Adam Optimization Clarification

Question

SGD versus Adam Optimization Clarification

Shinobii

2022年5月9日 17:01

Reading the Adam paper, I need some clarificaiton.

It states that SGD optimization updates the parameters with the same learning rate (i.e. it does not change throughout training). They state Adam is different as learning rate is variable (adaptive), and can change during training.

Is this the primary difference why Adam performs (for most cases) better than SGD? Also, it states that it is computationally cheaper, how can this be given that it seems more complex than SGD?

I hope my questions are clear!

Topic neural-network optimization

Category Data Science

Alex · Accepted Answer · 2020年8月3日 23:00

In many applications I've seen (e.g. GANs) $\beta_1$ is set to $0$, so $m_1=g_1$, i.e. the numerator of the update rule is the same as in SGD. This leaves two main differences, both related to the MA of the second moment:

$v_t:$ raw MA of the second moment serves as a gradient normalizer that divides the gradient by the square root of the moving average of squares of gradients
$1-\beta_2$. To reduce bias, $\sqrt{v_t}$ is also divided by $\sqrt{1-\beta_2^t}$. This follows the derivation of the expectation of the square of the gradient, $\mathbf{E}[\big(\frac{\partial E}{\partial w_t}\big)^2]$ in Section 3 of the article. Essentially $\mathbf{E}v_t = (1-\beta^t_2)\mathbf{E}[\big(\frac{\partial E}{\partial w_t}\big)^2] + \varepsilon, $ hence the exppression. Early in the training MAs are close to $0$, and division by $\sqrt{1-\beta_2}$ helps move away from it.

In probability and statistics, moments refer to the uncentered expressions of the form $\mathbf{E}X^k$, which moving averages estimates, hence the name. Normalization allows the gradient adjustment, and, hence, better parameter update

SGD versus Adam Optimization Clarification

About