Dissecting and understanding the Adam optimization's formula

Question

Dissecting and understanding the Adam optimization's formula

black sheep 369

2020年4月19日 06:43

Adam's optimization has the fololwing parameter update rule :

$$ \theta_{t+1} = \theta_{t} - \alpha*\dfrac{m_t}{\sqrt{v_t + \epsilon}}$$ where $$ m_t \text{ is first moment of gradients and} \space v_t \space \text{is second moment of gradient} $$

I have the following questions with regards to the above formula:

What exactly is first and second moments of gradients ? what's the intuition behind the first and second moment's formula?
I understand SGD with momentum and SGD with RMSprop but here we are making use of both of these. Again I don't understand the intuition behind diving the first moment by the square root of second moment

I looked up online and read various articles before coming here because none of the articles were helpful in providing intuition. I also tried reading the original papers but I found it hard to comprehend it.

Topic momentum gradient-descent optimization

Category Data Science

Dissecting and understanding the Adam optimization's formula

About