Is the "training loop" used in AlphaGo Zero the same as an "epoch"?

I am confused about the training stage of AlphaGo Zero using the data collected from the selfplay stage.

According to an AlphaGo Zero Cheat Sheet I found, the training routine is:

  • Loop from 1 to 1,000:
    • Sample a mini-batch of 2048 episodes from the last 500,000 games
    • Use this mini-batch as input for training (minimize their loss function)
  • After this loop, compare the current network (after the training) with the old one (prior the training)

However, after reading the article, I did not see any mentions on how many epochs they used with those mini-batches.

Questions:

  1. Are those 1,000 training iterations the actual epochs of the algorithm? The Keras code would then loosily be translated to:
network.fit(x_train, y_train, batch_size = 2048, epochs = 1000, ...)
  1. Or do they actually have a for loop for the training? The Keras code would then loosily be translated to:
for _ in range(1000):
    x_train, y_train = sample_states_from_past_games(data_from_selfplay)
    network.fit(x_train, y_train, batch_size = ???, epochs = ???, ...)

If it is the second option, I would like to know how many batches and epochs they have used.

Topic deepmind training keras tensorflow deep-learning

Category Data Science


I think they did the second option. If their network is fitted to a mini batch of 2,048 states for 1,000 epochs, it will be overfitted to the sampled 2,048 states. The trained network would be less likely to beat the old one.

There are numerous sample candidates. If we assume average turns of a game are 150, sample candidates are 75,000,000 states. Sampling would be done for each training iteration to reflect the many states.

In that case, batch_size will be 2,048 and epochs will be 1. (Actually, they used 64 workers and the batch-size was 32 per worker.)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.