How to train a policy and a value network, implementing alphazero at chess
So, I'm trying to implement alphazero's logic on the game of chess. What I understand so far of the algorithm is:
- Load 2 models, one of which is the best model you have so far. Both these models have a value network and a policy network and use MCTS to find the best move.
- Play n games between these 2 models and save the states, moves and who won each game
- Train the new model on a sample of the instances you have from the games. This involves training the policy and the value network of the model.
- Play m games between the 2 models again. If the new model wins more than 60% of the games, then the new model becomes the best model.
- Repeat until (what I want to achieve is that I can't beat the best model).
I have 2 questions:
- The new model is instantiated with totally random weights each time?
- How do I train the policy and value networks?
Let's say for example i wasn't playing chess but tic-tac-toe:
And let's say the game state of an example to train on was like this:
[1 0 -1],
[0 0 0],
[-1 0 1]
So train_X will be the state. The value network produced a value of 0.34 for this state and the policy network outputted this array=[0.23, 0.24, 0.25, 0.33, 0.55, 0.6, 0.21, 0.1, 0.12]
What will be the train_y for this instance? I'm guessing the value network's train_y will be 1 if first player won the game, else 0. Is that correct? What is the train_y value for the policy network in this case? How do I train the policy network?
P.S. I know from the state that it is not clear which player's turn it is. Any ideas about how to solve that problem too? It is not my biggest concert atm though
Topic deepmind reinforcement-learning deep-learning
Category Data Science