Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?
I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow
\begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] = \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ =\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ =0 \end{align}
Consider in this proof replacing $b(s)$ with the critic's quality estimate $Q_w(s,a)$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function $\nabla_\theta J(\theta)$ should also be $0$? Does this have to do with the second summation term changing from being over $a$ to over $a \in \mathcal{A}$?
Thank you.
Topic policy-gradients actor-critic reinforcement-learning
Category Data Science