Dissecting and understanding the Adam optimization's formula

Adam's optimization has the fololwing parameter update rule :

$$ \theta_{t+1} = \theta_{t} - \alpha*\dfrac{m_t}{\sqrt{v_t + \epsilon}}$$ where $$ m_t \text{ is first moment of gradients and} \space v_t \space \text{is second moment of gradient} $$

I have the following questions with regards to the above formula:

  • What exactly is first and second moments of gradients ? what's the intuition behind the first and second moment's formula?

  • I understand SGD with momentum and SGD with RMSprop but here we are making use of both of these. Again I don't understand the intuition behind diving the first moment by the square root of second moment

I looked up online and read various articles before coming here because none of the articles were helpful in providing intuition. I also tried reading the original papers but I found it hard to comprehend it.

Topic momentum gradient-descent optimization

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.