Reward dependent on (state, action) versus (state, action, successor state)
I am studying reinforcement learning and I am working methodically through Sutton and Barto's book plus David Silver's lectures.
I have noticed a minor difference in how the Markov Decision Processes (MDPs) are defined in those two sources, that affects the formulation of the Bellman equations, and I wonder about the reasoning behind the differences and when I might choose one or the other.
In Sutton and Barto, the expected reward function is written $R^a_{ss'}$, whilst in David Silver's lectures it is written $R^a_{s}$. In turn this leads to slightly different formulations of all the Bellman equations. For instance, in Sutton and Barto, the equation for policy evaluation is given by:
\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P_{ss'}^a(R_{ss'}^a + \gamma v_{\pi}(s')) \end{align}
Whilst David Silver's lectures show:
\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \left(R_{s}^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) \end{align}
In both cases:
- $\pi(a|s)$ is policy function - probability of choosing action $a$ given state $s$.
- $\gamma$ is discount factor.
- $P_{ss'}^a$ is transition function, probability of state changing to $s'$ given $s, a$
I understand that $R_{ss'}^a$ and $R_{s}^a$ are related (via $P_{ss'}^a$), so that these two sources are explaining the exact same thing. Note that the first equation can also be written as
\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} (P_{ss'}^aR_{ss'}^a + \gamma P_{ss'}^av_{\pi}(s'))\\ = \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \sum_{s'} \gamma P_{ss'}^av_{\pi}(s') \right) \\ = \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \gamma \sum_{s'} P_{ss'}^av_{\pi}(s') \right) \end{align}
Hence, it must be true that $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$.
My question is whether there is any reason I should prefer to use one or the other notation?
I started with Sutton and Barto, and find that notation more intuitive - the reward may depend on the eventual state, and this is explicit in the equations. However, it looks like in practice that the notation used in the video lectures describes more efficient calculations (essentially $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$ is cached, if the formula is translated directly to code). Is that all there is to it?
Topic markov-process reinforcement-learning
Category Data Science