Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)
There are so many PPO implementations that use GAE and do the following:
def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95):
values = values + [next_value]
gae = 0
returns = []
for step in reversed(range(len(rewards))):
delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
gae = delta + gamma * tau * masks[step] * gae
returns.insert(0, gae + values[step])
return returns
...
advantage = returns - values
...
critic_loss = (returns - value).pow(2).mean()
Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I made only slight changes for better readability.
- I understand why we use the function to calculate the advantage. But why are we using the same formula to also calculate the target values for the value network? The value network should output the value function. And wouldn't it be sufficient to just use discounted rewards to calculate the value function? E.g. why do we need $\lambda$ here? Shouldn't the target values be independent of $\lambda$?
- In the code they call it returns. What would be the mathematical symbol for it? I mean it is probably not $R_t$, right?
- Are the returns we are calculating here the same as:
$$\hat{V}_t^{GAE} = (1 - \lambda) \sum\limits_{N 0} \lambda^{N - 1} \hat{V}_t^{(N)} \approx V^{\pi}(s_t)$$
(Source: Paper What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study) And if that is the case then we have indeed a different target function as usual, right? Would it also work with just discounted rewards? And what is the reason why are people are instead using this calculation?
So my basic question is what is the (deep) mathematical reason why we are calculating the target values for the value network in this way?
Thank you very much in advance for your help!