About the time differences in the Bellman equation
I am trying to grasp fundamental mathematics behind the Reinforcement Learning and so far I have unterstood how the Value Iteration and Policy algorithms do converge (contractions, etc.)
I have still some problems about understanding the Bellman Equality. The Value function for a state $s$ under a policy $\pi$ is the expected discounted cumulative reward:
$$ V^\pi(s_0=s) = \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t R(s_t,\pi(s_t)) |s_0=s\right]$$.
During the derivation of the Bellman Equations, when the expected cumulative rewards are calculated on an infinite horizon, meaning $t \to \infty$, shouldn't they be equal for different time steps; like $V^{\pi}(s_i=s) = V^{\pi}(s_{i+\Delta i}=s), \forall s$? Since the summation is an infinite series, the discount factor $\gamma$ always starts with $0$ in the exponent and the state and reward distributions are stationary, as long as the the starting state is the same, the cumulative reward must be also the same:
$$ V^\pi(s_i=s) = \mathbb{E}\left[\sum_{t=i}^{\infty}\gamma^{t-i} R(s_t,\pi(s_t)) |s_i=s\right]= \mathbb{E}\left[R(s_i=s,\pi(s_i=s)) + \gamma R(s_{i+1},\pi(s_{i+1})) + \gamma^2 R(s_{i+2},\pi(s_{i+2})) + \dots\right]$$
$$ V^\pi(s_{i+\Delta i}=s) = \mathbb{E}\left[\sum_{t=i+\Delta i}^{\infty}\gamma^{t-i-\Delta} R(s_t,\pi(s_t)) |s_{i+\Delta i}=s\right]= \mathbb{E}\left[R(s_{i+\Delta i}=s,\pi(s_{i+\Delta i}=s)) + \gamma R(s_{{i+\Delta i}+1},\pi(s_{{i+\Delta i}+1})) + \gamma^2 R(s_{{i+\Delta i}+2},\pi(s_{{i+\Delta i}+2})) + \dots\right]$$
Is this true for the Bellman Equation?
Topic dynamic-programming reinforcement-learning machine-learning
Category Data Science