Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?

Question

Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?

KningTG

2020年4月27日 11:04

During the equation deducing in policy gradient algorithm(e.g., REINFORCE), we are actually using an expectancy of total reward, which we try to maximize.

$$\overline{R_\theta}=E_{\tau\sim\pi_\theta}[R(\tau)]$$

Can't it be seen as an Actor-Critic Method since we are using V(s) as a Critic to guide the update of Actor π? (Here we've already introduced an approximation) $$\nabla \overline{R_\theta} = \sum_{n=1}^N R(\tau^{(n)}) \nabla \log p(\tau)$$ If not, what's the clear definition of Actor and Critic defined in Actor-Critic Algorithm.

Topic policy-gradients actor-critic q-learning reinforcement-learning machine-learning

Category Data Science

Constantinos · Accepted Answer · 2019年7月25日 00:34

In RL we have:

Actor-only methods such as REINFORCE in which the output is a probability distributions over actions. REINFORCE is a policy gradient method but doesnt use a critic.
Critic-only methods such as Q-learning in which the output is the expected reward for every available action ($Q(s,a)$ $ \forall a\in A $)
Actor-Critic methods that involve both Actor and Critic estimations. For example the popular DDPG and A3C algorithms. Both algorithms are policy gradient methods. By reading the papers you will start getting a sense on why the simple REINFORCE introduces variance in gradient estimations and how a critic can reduce it.

Policy Gradient methods are based on the Policy Gradient theorem. A standard implementation is an Actor-Critic algorithm, and we use both Actor (probability distribution over actions) and Critic (Value functions) in order to trade-off bias and variance in your gradient estimations.

Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?

About