Is the "training loop" used in AlphaGo Zero the same as an "epoch"?
I am confused about the training stage of AlphaGo Zero using the data collected from the selfplay stage.
According to an AlphaGo Zero Cheat Sheet I found, the training routine is:
- Loop from 1 to 1,000:
- Sample a mini-batch of 2048 episodes from the last 500,000 games
- Use this mini-batch as input for training (minimize their loss function)
- After this loop, compare the current network (after the training) with the old one (prior the training)
However, after reading the article, I did not see any mentions on how many epochs they used with those mini-batches.
Questions:
- Are those 1,000 training iterations the actual epochs of the algorithm? The Keras code would then loosily be translated to:
network.fit(x_train, y_train, batch_size = 2048, epochs = 1000, ...)
- Or do they actually have a for loop for the training? The Keras code would then loosily be translated to:
for _ in range(1000):
x_train, y_train = sample_states_from_past_games(data_from_selfplay)
network.fit(x_train, y_train, batch_size = ???, epochs = ???, ...)
If it is the second option, I would like to know how many batches and epochs they have used.
Topic deepmind training keras tensorflow deep-learning
Category Data Science