Reinforcement Learning : Why acting greedily with the optimal value function gives you the optimal policy?
The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function.
It seems to be very simple, you just have to act greedily, by maximizing at each step the value function.
In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function :
You start from the up-left corner and the unique actions are the four classic displacements.
If we respect a greedy evolution, by maximizing at each step the value function, we gonna go right and then down, right, up ... never reaching the state with the best value (20).
So the question is simple : why don't we have to guess what will be the future rewards to get the optimal policy from the optimal value function ? In other words, why is one-step look ahead enough to get the optimal policy ?
A similar problem already appear with the rewards, and we solved this issue by weightening a sum of future expectations... Why don't we need to do the same ?
Could you help me understand what's wrong with my interpretation ?
Thanks a lot for your help !
Topic policy-gradients reinforcement-learning evaluation optimization
Category Data Science