Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?

As I understand it, in reinforcement learning, off-policy Monte Carlo control is when the state-action value function $Q(s,a)$ is estimated as a weighted average of the observed returns.

However, in Q-learning the value of $Q(s, a)$ is estimated as the maximum expected return.

Why is this not used in Monte Carlo control?

Suppose I have a simple 2-dimensional bridge game, where the objective is to get from a to b. I can move left, right, up or down. Lets say a reward of +1 is given for reaching b and -1 otherwise.

a|x|x|x|x|b

If the agent reach the final x (using epsilon-greedy policy for instance), then goes up to get a reward of -1. Then in the next episode, I reach the final x and go right to get a reward of +1.

Why wouldn't I update all the steps leading to 'b' to +1, as I would do in Q-learning?

I'm assuming the environment is deterministic in this case, so I don't overestimate $Q(s,a)$ based on an unlikely event.

Topic monte-carlo q-learning reinforcement-learning

Category Data Science


Monte Carlo methods can be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense. The term Monte Carlo is often used more broadly for any estimation method whose operation involves a significant random component. It is used specically for methods based on averaging complete returns.

An obvious way to estimate it from experience, then, is simply to average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value. This idea underlies all Monte Carlo methods.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.