Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

Question

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

Johannes

2022年4月6日 16:01

There are so many PPO implementations that use GAE and do the following:

def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95):
    values = values + [next_value]
    gae = 0
    returns = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
        gae = delta + gamma * tau * masks[step] * gae
        returns.insert(0, gae + values[step])
    return returns

...

advantage = returns - values

...

critic_loss = (returns - value).pow(2).mean()

Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I made only slight changes for better readability.

I understand why we use the function to calculate the advantage. But why are we using the same formula to also calculate the target values for the value network? The value network should output the value function. And wouldn't it be sufficient to just use discounted rewards to calculate the value function? E.g. why do we need $\lambda$ here? Shouldn't the target values be independent of $\lambda$?
In the code they call it returns. What would be the mathematical symbol for it? I mean it is probably not $R_t$, right?
Are the returns we are calculating here the same as:

$$\hat{V}_t^{GAE} = (1 - \lambda) \sum\limits_{N 0} \lambda^{N - 1} \hat{V}_t^{(N)} \approx V^{\pi}(s_t)$$

(Source: Paper What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study) And if that is the case then we have indeed a different target function as usual, right? Would it also work with just discounted rewards? And what is the reason why are people are instead using this calculation?

So my basic question is what is the (deep) mathematical reason why we are calculating the target values for the value network in this way?

Thank you very much in advance for your help!

Topic policy-gradients actor-critic mathematics reinforcement-learning machine-learning

Category Data Science

JunjieLi · Accepted Answer · 2021年6月30日 15:52

1

JunjieLi answered at 2021年6月30日 15:52

The implementation is somehow misleading: it's not returning GAE, but GAE+value.

Checkout the issue of this repo.

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

About