Adam Optimiser First Step

Question

Adam Optimiser First Step

foam78

2021年2月24日 17:13

Plotting the paths on the cost surface from different gradient descent optimisers on a toy example, I found that the Adam algorithm does not initially travel in the direction of steepest gradient (vanilla gradient descent did). Why might this be?

Later steps where affected by momentum etc but I would assume these effects wouldn’t come into play for the first few steps.

Topic momentum gradient-descent neural-network optimization machine-learning

Category Data Science

10xAI · Accepted Answer · 2021年2月24日 17:13

These are the equation of Adam [Ref - Dive Into Deep Learning]

\begin{aligned} \mathbf{v}_t & \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t \\ \mathbf{s}_t & \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \end{aligned}

\begin{aligned} \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t} \text{ and } \hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t} \end{aligned}

\begin{aligned} \mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon} \end{aligned}

\begin{aligned} \mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t' \end{aligned}

The first two are the accumulation of momentum and the second moment of the gradient
The second set is for correction of initial bias
Last two are parameter update

Initial values are - [Ref - Arxiv Paper] \begin{aligned} \mathbf{v} = \mathbf{s} = 0; \mathbf{t} = 1^{**} ; \beta_1=0.9 ; \beta_2=0.999 ; \epsilon = 10^{-8} \end{aligned} ^{Note - ** - It is initialized with 0 but incremeted in the loop before any other operation}

These defaults will make, \begin{aligned} \mathbf{g}_0' = \eta \text{ (approximated for } \epsilon\text{ )} \end{aligned} So, the initial movement will not be proportional to the Gradient.

Adam Optimiser First Step

About