Reinforcement Learning : Why acting greedily with the optimal value function gives you the optimal policy?

The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function.

It seems to be very simple, you just have to act greedily, by maximizing at each step the value function.

In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function :

You start from the up-left corner and the unique actions are the four classic displacements.

If we respect a greedy evolution, by maximizing at each step the value function, we gonna go right and then down, right, up ... never reaching the state with the best value (20).

So the question is simple : why don't we have to guess what will be the future rewards to get the optimal policy from the optimal value function ? In other words, why is one-step look ahead enough to get the optimal policy ?

A similar problem already appear with the rewards, and we solved this issue by weightening a sum of future expectations... Why don't we need to do the same ?

Could you help me understand what's wrong with my interpretation ?

Thanks a lot for your help !

Topic policy-gradients reinforcement-learning evaluation optimization

Category Data Science


It seems your value function is not optimal. After evaluation of value function, policy must be improved in accordance with it, and value function must be evaluated again, and then policy improvement, these steps continue one after another, until value function converges to optimal one. Then optimal policy can be obtained greedily with optimal value function.


You gave an exemple of a value function, but not an optimal value function, that would verify :

$$ V^{*}(s) =max_\pi(V^{\pi}(s)) $$

Such a value function would allow to use one-step look ahead approach to get the optimal policy. Because the optimal value function would be defined somehow iteratively to take neighbor value into account.

In other words, the problem with your value function is that it is not optimal, notably because of the significant step from 0 to 20. If it was optimal you would have a value nearer to 20 near it and the look-ahead would follow this path.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.