Adam Optimiser First Step

Plotting the paths on the cost surface from different gradient descent optimisers on a toy example, I found that the Adam algorithm does not initially travel in the direction of steepest gradient (vanilla gradient descent did). Why might this be?

Later steps where affected by momentum etc but I would assume these effects wouldn’t come into play for the first few steps.

Topic momentum gradient-descent neural-network optimization machine-learning

Category Data Science


These are the equation of Adam [Ref - Dive Into Deep Learning]

\begin{aligned} \mathbf{v}_t & \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t \\ \mathbf{s}_t & \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \end{aligned}

\begin{aligned} \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t} \text{ and } \hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t} \end{aligned}

\begin{aligned} \mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon} \end{aligned}

\begin{aligned} \mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t' \end{aligned}

  • The first two are the accumulation of momentum and the second moment of the gradient
  • The second set is for correction of initial bias
  • Last two are parameter update

Initial values are - [Ref - Arxiv Paper] \begin{aligned} \mathbf{v} = \mathbf{s} = 0; \mathbf{t} = 1^{**} ; \beta_1=0.9 ; \beta_2=0.999 ; \epsilon = 10^{-8} \end{aligned} Note - ** - It is initialized with 0 but incremeted in the loop before any other operation

These defaults will make, \begin{aligned} \mathbf{g}_0' = \eta \text{ (approximated for } \epsilon\text{ )} \end{aligned} So, the initial movement will not be proportional to the Gradient.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.