Alternative approach for Q-Learning

I have a question related to an alterative Q-Learning approach. I'd like to know if this already exists and I am not aware of it, or it doesn't exist because there are theoretical problems behind it.

Traditional Q-Learning
In traditional Q-learning, the update of the Q-value happens at every iteration. The agent is in state s, performs action a, reaches state s' and obtains reward r. The Q-value for that pair state-action is updated according to the Bellman equation. As an example, let's consider an agent exploring a grid-world. One specific cell in the world gives a positive reward. Until the agent finds the cell leading to a positive reward, the agent gets small or no rewards. Then, after reaching the cell with the high reward, all the previous cells gets updated iteratively with positive Q-values when they are explored. This means that we need many iterations, and especially if the path leading to the positive reward is very long or requires a specific set of actions, it can take a significant amount of episodes (or never be fully discovered, even if it was explored previously).

Alternative approach
My question is: why can't we update all the Q-values related to the agent at the end of the episode? That means, information is cached during iterations and when the agent terminates the episode, the Q-values gets updated for all the cells that led to that reward (or did not lead to a reward). In this case, the agent doesn't need tons of episodes to update the Q-values related to that path.
Is this approach legit? Which problem do we face following this?
The only drawback I can think of is lack of exploration, that can be solved choosing random actions according to a decreasing epsilon value.

EDIT

My problem is not computing efficiency, but it's mainly that I am using a descending learning rate, equal to $$\alpha = \frac{1}{1+number\_visits}$$ For this reason, my intuition says that updating the Q-values at the end of the episode should work better. Indeed, since the learning rate depends on the number of visits, the contribution of the early visits is way higher. Updating as soon as possible should reduce the negative impact of this aspect.

Topic q-learning reinforcement-learning machine-learning

Category Data Science


The Clear alternative is definitely a Monte-Carlo Search, if you really have to choose some alternatives. But if you are only facing a problem to update the Q values after the episode then let me mention you can do that with a little variation into regular Q learning equation and algorithm. Infect a very good example is the applied into Tic-Tac-Toe problem by Dwi H. et. al. In this paper, Dwi H train the agent by playing a full game and only when game is finished the reward is updated with positive if wining and negative if loosing. Also Deep-Q could also work in some scenarios but as @Neil suggested in his answer you have to explore possibilities, and if you had already developed Q learning method then best way is to modify it according to your need and see if it satisfy your needs before going to other options.


There are many variations of reinforcement learning methods. Some are tweaks to existing algorithms, some are radically different. The problems that you are concerned about with your analysis and suggestion appear to be:

  • Credit assignment. Whether or not a particular state or state action pair earlier in a trajectory is important to receiving a reward later on, and thus should have any value update due to new experience. This is a core problem to reinforcement learning (RL). All RL algorithms solve it, the differences are in how they deal with levels of difficulty.

  • Computational efficiency. Whether the algorithm makes best use of CPU cycles and doesn't waste its time calculating many updates of zero in an environment with sparse rewards.

  • Sample efficiency. Whether the algorithm makes best use of available data to make the most robust predictions of values it can.

Single-step Q learning does address all of these issues to at least some degree:

  • For credit assignment, the single step bootstrap process in Q learning will backup estimates through connected time steps. It takes repetition so that the chains of events leading to rewards are updated only after multiple passes through similar trajectories.

  • For computational efficiency, it is not always clear what the best approach would be. Single step Q learning updates are efficient to calculate, so even though each one only updates a small segment of a trajectory, this can be made up, in some environments, by running more updates in the same CPU time.

  • For sample efficiency, single-step updates on their own are not very good - they can leave many values untouched even though logically they could be updated based on new information. However, there are other approaches to address this - for instance using an experience replay table or Dyna-Q planning can take advantage of previous states and actions to re-calculate Q values without needing to see more experience.

As usual with ML problems, it is not always clear what adjustments and variations of an algorithm may work best. So you need to experiment with your environment.

You also mention this:

Until the agent finds the cell leading to a positive reward, the agent gets small or no rewards.

This is somewhere that single-step updates can work well. In your grid world example, if you are training to minimise the number of time steps to complete an episode, then typically you will have a negative reward associated with each time step that does not complete an episode. In that case, updating your Q estimates immediately on each action can be a large benefit - it encourages exploration by progressively scoring repetition worse (as the initial agent blunders around randomly) - meaning that the agent tries to find new states and actions that are still scored at original default levels. You won't get that searching behaviour if you wait until the end of the first episode before updating the value tables.

Whether or not this is an issue will depend on the environment, but such time related goals are relatively common, so it is worth bearing in mind.

Alternative approach My question is: why can't we update all the Q-values related to the agent at the end of the episode?

This is a well known common approach in RL. The simplest variant is called Monte Carlo Control, and it uses trajectories from whole episodes to update values of all state/action pairs seen in those trajectories.

More sophisticated approaches blend between single-step Temporal Difference learning (of which Q learning is one example) and whole episode Monte Carlo methods. These are n-step methods (n-step Q-learning is one example) and TD($\lambda$) methods (Q($\lambda$) is one example). These are relatively popular as they combine best features of, and avoid compromises of, the two extremes of whole episode vs single step.

Is this approach legit?

Yes, in fact as described above, variations of it are used routinely.

Which problem do we face following this?

When using whole episodes, you face the following problems:

  • High variance. If either your policy or your environment has randomness, then the final trajectory and reward in each episode will vary a lot. Which means to get reliable estimates of value you may need more samples, not less as you hoped.

  • Off-policy learning is more complex. Q-learning in particular is an off-policy method, meaning it learns values of its best guess at an optimal policy (called the target policy) whilst still exploring using a non-optimal policy (called the behaviour policy). With single-step updates this difference can be ignored as you always calculate the updated value estimate for an action you've taken (you don't care what policy generated this) using the maximising action for the next step (so you don't use the action actually taken). When you extend to calculating value updates over multiple steps, you need to adjust for differences between behaviour and target policies. If you look at Q($\lambda$) algorithm you will see it copes with this by resetting something called the eligibility trace (which you can think of as a vector that explicitly tracks credit assignment) whenever the agent takes a non-optimal action according to the current value estimates.

There is also Off-policy Monte Carlo Control, which uses importance sampling to decide whether and how much to use particular updates - for learning optimal control typically this will only update from the ends of episodes after the last exploratory action that the agent took.

The only drawback I can think of is lack of exploration, that can be solved choosing random actions according to a decreasing epsilon value.

Usually you will want some initial high exploration that settles down closer to the predicted optimal policy later on. This or something similar should be standard practice in many environments. However, it would not solve the search problem you mention - for that you need an exploration system that is aware of when states and values are repeated themselves over multiple steps within a single episode. There's more than one way to solve that, too, but the single-step Q learning approach is quite nice when it works because it is a natural consequence of the algorithm.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.