In first visit monte carlo are we assuming the environment is the same over episodes?

Question

In first visit monte carlo are we assuming the environment is the same over episodes?

Gulzar

2020年9月16日 01:02

Watching this video (11:30) that presents the simplest algorithm for reinforcement learning: Monte Carlo Policy Evaluation, which says in general:

The first time a sate is visited:

increment N(s): N(s) = N(s) + 1
increment total state's return function by current episode's return so far S(s) = S(s) + G_t

State's value is estimated by mean return over many episodes:

V(s) = S(s) / N(s)

by law of large numbers, V(s)--V_true(S) as N(S)--inf

My question is - should the environment always behave the same over different episodes, or can it change randomly, and we sill have the true value over a large enough number of episodes?

Topic monte-carlo reinforcement-learning

Category Data Science

shaabhishek · Accepted Answer · 2018年12月15日 10:27

Yes, an assumption in the MDP formulation in RL is stationarity: the dynamics of the environment do not change over time (within or across episodes).

Note that this almost always false in the real world, but this assumption simplifies matters, so we have it.

In first visit monte carlo are we assuming the environment is the same over episodes?

About