Dissecting and understanding the Adam optimization's formula
Adam's optimization has the fololwing parameter update rule :
$$ \theta_{t+1} = \theta_{t} - \alpha*\dfrac{m_t}{\sqrt{v_t + \epsilon}}$$ where $$ m_t \text{ is first moment of gradients and} \space v_t \space \text{is second moment of gradient} $$
I have the following questions with regards to the above formula:
What exactly is first and second moments of gradients ? what's the intuition behind the first and second moment's formula?
I understand SGD with momentum and SGD with RMSprop but here we are making use of both of these. Again I don't understand the intuition behind diving the first moment by the square root of second moment
I looked up online and read various articles before coming here because none of the articles were helpful in providing intuition. I also tried reading the original papers but I found it hard to comprehend it.
Topic momentum gradient-descent optimization
Category Data Science