RL Sutton book, initial estimate of q*(a) for 10 arm testbed

Question

RL Sutton book, initial estimate of q*(a) for 10 arm testbed

mLstudent33

2019年8月4日 11:36

The Sutton book does not mention what the initial estimate is for q*(a) before the first reward is received. In this code repo that seems to go along with the book: Sutton code repo

They have initialized it with 0 per snippet below:

def __init__(self, kArm=10, epsilon=0., initial=0., stepSize=0.1, sampleAverages=False, UCBParam=None,
                 gradient=False, gradientBaseline=False, trueReward=0.):

But the explanation for Figure 2.1 that shows the distribution of rewards for the 10 arms of the bandit says,

Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q ⇤ (a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q ⇤ (a) unit variance normal distribution, as suggested by these gray distributions.

So should I initialize instead with np.random.randn()?

Edit: The distribution

Topic randomized-algorithms reinforcement-learning

Category Data Science

Neil Slater · Accepted Answer · 2019年8月4日 09:20

The description you quote explains how the true values will be set in the test when setting up a test run. This is necessary to fully state how the test works.

Initialisation of your estimates is a different issue. If you know something about the distributions of the true action values, then it would make sense to use that. For instance you could set all action values to the mean expected true value. Which is $0$. However, you may also use $0$ if you have no idea about true values, as it is a simple arbitrary value.

Setting the estimates to something using the same distribution is not unreasonable, as it that itself is an unbiased estimate of the same mean. However, it does not really serve you well here, because it adds variance to the initial estimates (as well as possibilty of them being closer to true values, they are equally likely to be worse) and on average slowing down some agent types slightly.

RL Sutton book, initial estimate of q*(a) for 10 arm testbed

About