Maximum Entropy Policy Gradient Derivation

I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1)

Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)).

In particular, I cannot understand how the second summation term from t'=t to t'=T in the second line arises in the derivation. I managed to derive the formula by expanding the definition of expectation, but the result that I got do not have the second summation term. Can anyone give me ideas on where this second summation term comes from mathematically?

Topic derivation policy-gradients reinforcement-learning gradient-descent machine-learning

Category Data Science


Writing the objective in a different way may be helpful.

If we only focus on the reward part and ignore the entropy part since the focus in the question is about the second summation, the original objective is $$ J(\theta)=\sum_{t=1}^T\mathbb{E}_{(s_t,a_t)\sim q(s_t,a_t)}[r(s_t,a_t)] $$

If we look at it from the perspective of the full trajectory $\tau$, just like the ELBO in the paper Eq.19, the objective is $$ J(\theta)=\mathbb{E}_{\tau\sim q_\theta(\tau)}\left[\sum_{t=1}^Tr(s_t,a_t)\right] $$

Then $$ \begin{aligned} \nabla_\theta J(\theta)&=\nabla_\theta\sum_\tau q_\theta(\tau)\sum_{t=1}^T r(s_t,a_t)\\ &=\sum_\tau \nabla_\theta q_\theta(\tau)\sum_{t=1}^T r(s_t,a_t)\\ &=\sum_\tau q_\theta(\tau)\nabla_\theta \log q_\theta(\tau)\sum_{t=1}^T r(s_t,a_t)\\ &=\mathbb{E}_{\tau\sim q_\theta(\tau)}\left[\nabla_\theta \log q_\theta(\tau)\sum_{t=1}^Tr(s_t,a_t)\right] \end{aligned} $$

Since $q_\theta(\tau)=q(s_1)\prod_{t=1}^T q(s_{t+1}|s_t|a_t) q_\theta(a_t|s_t)$, $$ \nabla_\theta \log q_\theta(\tau)=\sum_{t=1}^T \nabla_\theta \log q_\theta(a_t|s_t) $$

Then $$ \nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim q_\theta(\tau)}\left[\sum_{t=1}^T \nabla_\theta \log q_\theta(a_t|s_t) \sum_{t=1}^Tr(s_t,a_t)\right] $$

That is where the second summation comes from.

Since policy at time $t'$ cannot affect reward at time $t$ when $t<t'$ (causality), $$ \begin{aligned} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau\sim q_\theta(\tau)}\left[\sum_{t=1}^T \nabla_\theta \log q_\theta(a_t|s_t) \sum_{t'=t}^Tr(s_t',a_t')\right]\\ &=\sum_{t=1}^T \mathbb{E}_{(s_t,a_t)\sim q(s_t,a_t)}\left[\nabla_\theta \log q_\theta(a_t|s_t) \sum_{t'=t}^Tr(s_t',a_t')\right] \end{aligned} $$

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.