RL Sutton book, initial estimate of q*(a) for 10 arm testbed
The Sutton book does not mention what the initial estimate is for q*(a) before the first reward is received. In this code repo that seems to go along with the book: Sutton code repo
They have initialized it with 0 per snippet below:
def __init__(self, kArm=10, epsilon=0., initial=0., stepSize=0.1, sampleAverages=False, UCBParam=None,
gradient=False, gradientBaseline=False, trueReward=0.):
But the explanation for Figure 2.1 that shows the distribution of rewards for the 10 arms of the bandit says,
Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q ⇤ (a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q ⇤ (a) unit variance normal distribution, as suggested by these gray distributions.
So should I initialize instead with np.random.randn()
?
Edit: The distribution
Topic randomized-algorithms reinforcement-learning
Category Data Science