What is the difference between dynamic programming and Q-learning?

Question

What is the difference between dynamic programming and Q-learning?

user10296606

2019年2月15日 15:08

What is the difference between the DP-based algorithm and Q-learning?

Topic dynamic-programming q-learning reinforcement-learning

Category Data Science

Neil Slater · Accepted Answer · 2019年1月22日 07:26

Both Q learning and Value Iteration (a DP technique) use similar update rules based on Bellman optimality equations:

$$v_*(s) = \text{max}_{a}\sum_{s',r} p(s',r|s,a)(r + \gamma v_*(s'))$$

$$q_*(s,a) = \sum_{s',r} p(s',r|s,a)(r + \gamma\text{max}_{a'}q_*(s',a'))$$

The main difference is that DP uses an explicit model. DP requires that you know $p(s',r|s,a)$. The update rule for DP is literally the first equation turned into an update rule:

$$v_{k+1}(s) = \text{max}_{a}\sum_{s',r} p(s',r|s,a)(r + \gamma v_{k}(s'))$$

In comparison, Q learning does not require knowing $p(s',r|s,a)$, as it is based on sampling from experience. The update rule is modified to be based on samples of observed data, which have the same values in expectation, as if you had used $p(s',r|s,a)$, but without knowing it:

$$Q_{k+1}(S_t,A_t) = Q_{k}(S_t,A_t) + \alpha(R_{t+1} + \gamma\text{max}_{a'}Q_k(S_{t+1},a') - Q_{k}(S_t,A_t))$$

This is still an important difference even when both systems are run on an internal model/simulation. DP does not need to simulate anything, it iterates over the model directly. Whilst Q learning needs to work with sampled transitions - they might be simulated, but this is not the same as iterating over all states as in DP. It can often be the case that it is easier to simulate the environment than to calculate $p(s',r|s,a)$ for the full model.

Which should you choose:

Choose Dynamic Programming when you have access to the full state transition and reward model in a simple form (i.e. you have $p(s',r|s,a)$ or equivalent), and the state space is not too large - ideally the number of states is small enough to fit in memory. However, there are ways to use DP when you have a larger state space, by modifying which states it processes. So you still can use DP on larger problems if you really want to.
Choose Q learning when you don't have a model, or when the state space is too large to iterate over in full.

What is the difference between dynamic programming and Q-learning?

About