Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow

\begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] = \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ =\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ =0 \end{align}

Consider in this proof replacing $b(s)$ with the critic's quality estimate $Q_w(s,a)$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function $\nabla_\theta J(\theta)$ should also be $0$? Does this have to do with the second summation term changing from being over $a$ to over $a \in \mathcal{A}$?

Thank you.

Topic policy-gradients actor-critic reinforcement-learning

Category Data Science


The crucial point here is, that the baseline is state dependent, therefore the notation $B(s)$. If you use the estimate $Q_w(s, a)$ you get a baseline that depends on both states and actions, basically $B(s, a)$.

You already figured out, why the proof does not work anymore in that case.


After thinking about this, I've realized that $Q(s,a)$ relies on the action and thus cannot be pulled out of the sum in the same way $B(s)$ can. I'm leaving this up for anyone interested in the same thing.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.