Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?
As I understand it, in reinforcement learning, off-policy Monte Carlo control is when the state-action value function $Q(s,a)$ is estimated as a weighted average of the observed returns.
However, in Q-learning the value of $Q(s, a)$ is estimated as the maximum expected return.
Why is this not used in Monte Carlo control?
Suppose I have a simple 2-dimensional bridge game, where the objective is to get from a to b. I can move left, right, up or down. Lets say a reward of +1 is given for reaching b and -1 otherwise.
If the agent reach the final x (using epsilon-greedy policy for instance), then goes up to get a reward of -1. Then in the next episode, I reach the final x and go right to get a reward of +1.
Why wouldn't I update all the steps leading to 'b' to +1, as I would do in Q-learning?
I'm assuming the environment is deterministic in this case, so I don't overestimate $Q(s,a)$ based on an unlikely event.
Topic monte-carlo q-learning reinforcement-learning
Category Data Science