About the time differences in the Bellman equation

I am trying to grasp fundamental mathematics behind the Reinforcement Learning and so far I have unterstood how the Value Iteration and Policy algorithms do converge (contractions, etc.)

I have still some problems about understanding the Bellman Equality. The Value function for a state $s$ under a policy $\pi$ is the expected discounted cumulative reward:

$$ V^\pi(s_0=s) = \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t R(s_t,\pi(s_t)) |s_0=s\right]$$.

During the derivation of the Bellman Equations, when the expected cumulative rewards are calculated on an infinite horizon, meaning $t \to \infty$, shouldn't they be equal for different time steps; like $V^{\pi}(s_i=s) = V^{\pi}(s_{i+\Delta i}=s), \forall s$? Since the summation is an infinite series, the discount factor $\gamma$ always starts with $0$ in the exponent and the state and reward distributions are stationary, as long as the the starting state is the same, the cumulative reward must be also the same:

$$ V^\pi(s_i=s) = \mathbb{E}\left[\sum_{t=i}^{\infty}\gamma^{t-i} R(s_t,\pi(s_t)) |s_i=s\right]= \mathbb{E}\left[R(s_i=s,\pi(s_i=s)) + \gamma R(s_{i+1},\pi(s_{i+1})) + \gamma^2 R(s_{i+2},\pi(s_{i+2})) + \dots\right]$$

$$ V^\pi(s_{i+\Delta i}=s) = \mathbb{E}\left[\sum_{t=i+\Delta i}^{\infty}\gamma^{t-i-\Delta} R(s_t,\pi(s_t)) |s_{i+\Delta i}=s\right]= \mathbb{E}\left[R(s_{i+\Delta i}=s,\pi(s_{i+\Delta i}=s)) + \gamma R(s_{{i+\Delta i}+1},\pi(s_{{i+\Delta i}+1})) + \gamma^2 R(s_{{i+\Delta i}+2},\pi(s_{{i+\Delta i}+2})) + \dots\right]$$

Is this true for the Bellman Equation?

Topic dynamic-programming reinforcement-learning machine-learning

Category Data Science


During the derivation of the Bellman Equations, when the expected cumulative rewards are calculated on an infinite horizon, meaning $t \to \infty$, shouldn't they be equal for different time steps; like $V^{\pi}(s_i=s) = V^{\pi}(s_{i+\Delta i}=s), \forall s$?

Yes, provided the state $s$ has the Markov property, which means it contains all relevant data about immediate state progression and reward, then this is true. I would read your equations as over-specified in fact, and often see the equations written differently:

$$v_{\pi}(s) = \mathbb{E}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t \sim \pi]$$

although there are several roughly equivalent ways to describe value functions in RL (for instance, and trivially, you could write $\mathbb{E}_{\pi}$ for the expectation which already implies the condition $A_t \sim \pi$). The one in this answer makes it clear that the value of the state $s$ is independent of the time step on which it is measured.

It is worth noting that although the expectation is always the same and independent of $t$, that actual returns seen can be different, and if you know from some other context that you are seeing the first, or last visit to a state - or from the time step whether this is more or less likely given $\pi$ - then that may change the expectation when loops are possible.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.