Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

Question

Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

Tommaso Bendinelli

2021年3月12日 21:03

TD(0) algorithm is defined as the iterative update of the following:

$$ V(s) \leftarrow V(s) + \alpha({r + \gamma V(s')} - V(s) ) $$

Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?

Topic dynamic-programming reinforcement-learning

Category Data Science

Brian Spiering · Accepted Answer · 2020年6月15日 13:53

1

Brian Spiering answered at 2020年6月15日 13:53

No - Dynamic programming estimates the value of the next state by first looking at all possible next states. Temporal difference 0 estimates the value of the next state by only looking at a single next state.

Dany Yatim · Accepted Answer · 2020年1月6日 18:08

$\alpha$ is independent of the type of RL algorithm. It is the learning rate, i.e. the rate at which you will update a state value. You could set it to 1 or less.

Policy evaluation is a 'general principle'. Temporal difference is a way to make it work. More precisely, TD defines by how far in the future you take in account the consequences of an action. In your equation, gamma defines by how much you take that future into account.

Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

About