How to train a policy and a value network, implementing alphazero at chess

Question

How to train a policy and a value network, implementing alphazero at chess

Gerasimos Delivorias

2022年3月18日 08:44

So, I'm trying to implement alphazero's logic on the game of chess. What I understand so far of the algorithm is:

Load 2 models, one of which is the best model you have so far. Both these models have a value network and a policy network and use MCTS to find the best move.
Play n games between these 2 models and save the states, moves and who won each game
Train the new model on a sample of the instances you have from the games. This involves training the policy and the value network of the model.
Play m games between the 2 models again. If the new model wins more than 60% of the games, then the new model becomes the best model.
Repeat until (what I want to achieve is that I can't beat the best model).

I have 2 questions:

The new model is instantiated with totally random weights each time?
How do I train the policy and value networks?

Let's say for example i wasn't playing chess but tic-tac-toe:

And let's say the game state of an example to train on was like this:

[1 0 -1],

[0 0 0],

[-1 0 1]

So train_X will be the state. The value network produced a value of 0.34 for this state and the policy network outputted this array=[0.23, 0.24, 0.25, 0.33, 0.55, 0.6, 0.21, 0.1, 0.12]

What will be the train_y for this instance? I'm guessing the value network's train_y will be 1 if first player won the game, else 0. Is that correct? What is the train_y value for the policy network in this case? How do I train the policy network?

P.S. I know from the state that it is not clear which player's turn it is. Any ideas about how to solve that problem too? It is not my biggest concert atm though

Topic deepmind reinforcement-learning deep-learning

Category Data Science

Stefan Österberg · Accepted Answer · 2022年3月18日 08:44

im pretty sure that in training games they make it so the network always sees the position from their position so that its not learning good positions for some sides but instead looking at the possition from the opponment who is currently making a turn so for exammple a game could look like.

000 010 000

100 020 000

200 010 010

How to train a policy and a value network, implementing alphazero at chess

About