In first visit monte carlo are we assuming the environment is the same over episodes?
Watching this video (11:30) that presents the simplest algorithm for reinforcement learning: Monte Carlo Policy Evaluation, which says in general:
The first time a sate is visited:
- increment N(s): N(s) =
N(s) + 1
- increment total state's return function by current episode's return so far
S(s) = S(s) + G_t
State's value is estimated by mean return over many episodes:
V(s) = S(s) / N(s)
by law of large numbers, V(s)--V_true(S)
as N(S)--inf
My question is - should the environment always behave the same over different episodes, or can it change randomly, and we sill have the true value over a large enough number of episodes?
Topic monte-carlo reinforcement-learning
Category Data Science