Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

Question

Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

Alex Van de Kleut

2019年11月27日 12:02

I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow

\begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] = \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ =\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ =0 \end{align}

Consider in this proof replacing $b(s)$ with the critic's quality estimate $Q_w(s,a)$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function $\nabla_\theta J(\theta)$ should also be $0$? Does this have to do with the second summation term changing from being over $a$ to over $a \in \mathcal{A}$?

Thank you.

Topic policy-gradients actor-critic reinforcement-learning

Category Data Science

Andrei Poehlmann · Accepted Answer · 2019年3月2日 08:47

The crucial point here is, that the baseline is state dependent, therefore the notation $B(s)$. If you use the estimate $Q_w(s, a)$ you get a baseline that depends on both states and actions, basically $B(s, a)$.

You already figured out, why the proof does not work anymore in that case.

Alex Van de Kleut · Accepted Answer · 2019年2月5日 01:50

1

Alex Van de Kleut answered at 2019年2月5日 01:50

After thinking about this, I've realized that $Q(s,a)$ relies on the action and thus cannot be pulled out of the sum in the same way $B(s)$ can. I'm leaving this up for anyone interested in the same thing.

Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

About