Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

TD(0) algorithm is defined as the iterative update of the following:

$$ V(s) \leftarrow V(s) + \alpha({r + \gamma V(s')} - V(s) ) $$

Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?

Topic dynamic-programming reinforcement-learning

Category Data Science


No - Dynamic programming estimates the value of the next state by first looking at all possible next states. Temporal difference 0 estimates the value of the next state by only looking at a single next state.


$\alpha$ is independent of the type of RL algorithm. It is the learning rate, i.e. the rate at which you will update a state value. You could set it to 1 or less.

Policy evaluation is a 'general principle'. Temporal difference is a way to make it work. More precisely, TD defines by how far in the future you take in account the consequences of an action. In your equation, gamma defines by how much you take that future into account.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.