Maximum Entropy Policy Gradient Derivation
I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1)
Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)).
In particular, I cannot understand how the second summation term from t'=t to t'=T in the second line arises in the derivation. I managed to derive the formula by expanding the definition of expectation, but the result that I got do not have the second summation term. Can anyone give me ideas on where this second summation term comes from mathematically?