What is the difference between bootstrapping and sampling in reinforcement learning?

I have come across a David Silver's slide which contains both the terms "bootstrapping" and "sampling". Is there any realistic example which helps me to understand the concepts better.

Topic difference terminology reinforcement-learning

Category Data Science


  • Bootstrapping needs just a single transition, or a single tuple (state, action, next_state, reward) in order to perform a value (Q-value) update; thus learning can occur without a full episode of transitions. This is used in Q-learning type recursions. Since we are not waiting for a full episode to make an update, playing can be intertwined with learning. In this kind of learning, we may have a stochastic policy (say, epsilon greed), where we are using the maximum Q-value in the next state to update the current Q-value, but we may not have to take the action that actually maximizes the Q function when we get to that next state. As such, we call Q-learning an off-policy method, since learning is not guided by the policy.

  • Sampling requires several transitions or even a full episode (sequence of transitions from initial state to terminating state) in order to perform the update. Since we hope to learn a policy, and the update is done for a single episode, we must follow the same policy throughout the episode before we can improve it (or update it). This is called policy iteration, and is use in Monte-Carlo learning, and it is on-policy because the policy guides the learning process. Because the episode must end before an update is done, learning would be very slow if episodes consists of so many transitions; thus the method cannot be used for non-episodic tasks (like games with infinite horizons).


I will try to answer this question conceptually and not technically so you get a grasp of the mechanisms in RL.

  • Bootstrapping: When you estimate something based on another estimation. In the case of Q-learning for example this is what is happening when you modify your current reward estimation $r_t$ by adding the correction term $\max_a' Q(s',a')$ which is the maximum of the action value over all actions of the next state. Essentially you are estimating your current action value Q by using an estimation of the future Q. Neil has answered that in detail here.

  • Sampling: Imagine samples as realizations (different values) of a function. Many times it is really difficult to estimate, or come up with an analytical expression, of the underlying process that generated your observations. However sampling values can help you determine lots of characteristics of the underlying generative mechanism and even make assumptions of its properties. Sampling can come in many forms in RL. For example, Q learning is a sampled-based point estimation of the optimal action value function (Bellman equation). In a world that your agent knows nothing you cannot use Dynamic Programming to determine the expected reward form every state. Thus, you need to sample "experience" from your world and estimate the expected reward from any state. In Policy Gradient methods in order to determine the gradient of your expected reward you need to sample trajectories over states and actions from a probability distribution as you cannot determine it analytically.

Hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.