Reward dependent on (state, action) versus (state, action, successor state)

Question

Reward dependent on (state, action) versus (state, action, successor state)

Neil Slater

2021年8月7日 17:44

I am studying reinforcement learning and I am working methodically through Sutton and Barto's book plus David Silver's lectures.

I have noticed a minor difference in how the Markov Decision Processes (MDPs) are defined in those two sources, that affects the formulation of the Bellman equations, and I wonder about the reasoning behind the differences and when I might choose one or the other.

In Sutton and Barto, the expected reward function is written $R^a_{ss'}$, whilst in David Silver's lectures it is written $R^a_{s}$. In turn this leads to slightly different formulations of all the Bellman equations. For instance, in Sutton and Barto, the equation for policy evaluation is given by:

\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P_{ss'}^a(R_{ss'}^a + \gamma v_{\pi}(s')) \end{align}

Whilst David Silver's lectures show:

\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \left(R_{s}^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) \end{align}

In both cases:

$\pi(a|s)$ is policy function - probability of choosing action $a$ given state $s$.
$\gamma$ is discount factor.
$P_{ss'}^a$ is transition function, probability of state changing to $s'$ given $s, a$

I understand that $R_{ss'}^a$ and $R_{s}^a$ are related (via $P_{ss'}^a$), so that these two sources are explaining the exact same thing. Note that the first equation can also be written as

\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} (P_{ss'}^aR_{ss'}^a + \gamma P_{ss'}^av_{\pi}(s'))\\ = \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \sum_{s'} \gamma P_{ss'}^av_{\pi}(s') \right) \\ = \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \gamma \sum_{s'} P_{ss'}^av_{\pi}(s') \right) \end{align}

Hence, it must be true that $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$.

My question is whether there is any reason I should prefer to use one or the other notation?

I started with Sutton and Barto, and find that notation more intuitive - the reward may depend on the eventual state, and this is explicit in the equations. However, it looks like in practice that the notation used in the video lectures describes more efficient calculations (essentially $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$ is cached, if the formula is translated directly to code). Is that all there is to it?

Topic markov-process reinforcement-learning

Category Data Science

João Pedro · Accepted Answer · 2021年8月7日 17:44

IMO, this notation is confusing and is better to think in terms of probability distributions. I will tackle the case of a finite MDP to simplify things.

The reward function on a MDP depends only on current state and actions:

$$r(s,a)=\mathbb{E}[R_{t+1}|s,a]\\ r(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r\\\\$$

If we now condition on next states by using the law of total probability we get:

$$r(s,a)=\sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(r|s,a,s')p(s'|s,a)r\\\\$$

Then we observe that $p(r,s'|s,a)=p(s'|s,a)p(r|s,a,s')$:

$$r(s,a)=\sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(r,s'|s,a)r\\\\$$

In summary, you can consider the joint distribution by averaging over all next states and their probabilities, that would explain the need for $P_{ss'}^a$ when you try to express the reward function in terms of next states probabilities.

Constantinos · Accepted Answer · 2017年3月25日 13:40

Your intuition is correct. In the most general case (Sutton's definitions), the model of the environment consists of the state transition distribution and the reward distribution. The latter one is rarely considered, as many times the reward is being assigned by the modeler, dependent only by the selected action from the current state and being deterministic. As you mentioned it simplifies a lot the coding implementation.

Reward dependent on (state, action) versus (state, action, successor state)

About