Reinforcement learning policy gradient derivation

Question

Reinforcement learning policy gradient derivation

endeavor

2021年12月30日 10:06

I was reading a document about Reinforcement Learning policy gradient http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes8.pdf when I encountered this expression

$ \nabla_{\theta} \mathbb{E_{\pi_{\theta}}}[r_{t^{t}}] = \mathbb{E_{\pi_{\theta}}} \left[ r_{t^{'}} \sum_{t = 0}^{t^{'}} \nabla_{\theta} \log \pi_{\theta} (a_t|s_t) \right] $

which is on page 6 just below (11). The problem is I have no idea how is this expression derived. The document says that it can be derived the same way as (11) but I do not understand how. Any pointers or hints would be appreciated.

Topic reward policy-gradients mathematics reinforcement-learning neural-network

Category Data Science

Reinforcement learning policy gradient derivation

About