In first visit monte carlo are we assuming the environment is the same over episodes?

Watching this video (11:30) that presents the simplest algorithm for reinforcement learning: Monte Carlo Policy Evaluation, which says in general:

The first time a sate is visited:

  1. increment N(s): N(s) = N(s) + 1
  2. increment total state's return function by current episode's return so far S(s) = S(s) + G_t

State's value is estimated by mean return over many episodes:

V(s) = S(s) / N(s)

by law of large numbers, V(s)--V_true(S) as N(S)--inf

My question is - should the environment always behave the same over different episodes, or can it change randomly, and we sill have the true value over a large enough number of episodes?

Topic monte-carlo reinforcement-learning

Category Data Science


Yes, an assumption in the MDP formulation in RL is stationarity: the dynamics of the environment do not change over time (within or across episodes).

Note that this almost always false in the real world, but this assumption simplifies matters, so we have it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.