Find highest reward for epsilon-greedy bandit program

Question

Find highest reward for epsilon-greedy bandit program

vishak raj

2022年6月3日 17:23

I started to learn reinforcement learning, the first example is handling bandit program using epsilon-greedy method,

In this example, there are three bandit machines used, the output is the mean value for all bandit machines and cumulative average with respect to the epsilon value

The code -

class Bandit:
  def __init__(self, m):
    self.m = m
    self.mean = 0
    self.N = 0

  def pull(self):
    return np.random.randn() + self.m

  def update(self, x):
    self.N += 1
    self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x


def run_experiment(m1, m2, m3, eps, N):
  bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]

  data = np.empty(N)

  for i in range(N):
    # epsilon greedy
    p = np.random.random()
    if p  eps:
      j = np.random.choice(3)
    else:
      j = np.argmax([b.mean for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)


    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

  for b in bandits:
    print(b.mean)

  return cumulative_average

if __name__ == '__main__':
  c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)
  c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
  c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)

  # log scale plot
  plt.plot(c_1, label='eps = 0.1')
  plt.plot(c_05, label='eps = 0.05')
  plt.plot(c_01, label='eps = 0.01')
  plt.legend()
  plt.xscale('log')
  plt.show()

The output plot(cumulative average for different epsilon values (0.1, 0.05, 0.01) -

I understand form the output graph, that the cumulative average for the machines using epsilon value = 0.01 scored more rewards than the other epsilon value(0.1 and 0.05)

Here we are just comparing the different epsilon values,

But, how to decide which machine is the best which gives more rewards

I am looking forward to learn, thanks

Topic implementation reinforcement-learning deep-learning machine-learning

Category Data Science

Find highest reward for epsilon-greedy bandit program

About