AlphaGo Zero loss function

As far as I understood from the AlphaGo Zero system:

  • During the self-play part, the MCTS algorithm stores a tuple ($s$, $\pi$, $z$) where $s$ is the state, $\pi$ is the distribution probability over the actions in the state and $z$ is an integer representing the winner of the game that state is in.
  • The network will receive $s$ as input (a stack of matrices describing the state $s$) and will output two values: $p$ and $v$. $p$ is a distribution probability over the actions and $v$ is a value in $[-1,1]$ representing which player is likely to win the game.
  • For the training it will use the following loss function:

$$l = (z - v)^2 - \pi^T log(p) + c ||\theta||^2$$

  • Lastly, it evaluates the new network and it starts the self-play section again.

My questions

  • If the network receives only the state $s$ (represented as matrices) as input, how can it then calculate the loss function if the values $\pi$ and $z$ are needed?

  • If these values are indeed passed as input for the network, are they passed through the convolutional (and other) layers of the network? Because if this is true, there is no mention in the article (unless I missed it) of this information.

Topic deepmind keras tensorflow loss-function deep-learning

Category Data Science


The best way to understand that part is by looking at figure 1 in the AlphaGo Zero paper.

The neural network (NN) minimizes the differences between its own policy $p_t$ and the MCTS policy $\pi_t$. The value of $\pi_t$ is produced by the MCTS self-play which in return uses the NN from the previous iteration.

The same goes for $v_t$ and $z$. In each iteration the weights of the NN are adjusted to minimize the distance between $v_t$ (output of the NN) and $z$ (output of the MCTS) as defined by the loss function. $z$ does not have a time index here as the full self-play produces just a single value for $z$ each time it is conducted.

TLDR for your first question: Both, $\pi$ and $v$, are being produced by the MCTS as input to the NN.

(The indexing in the paper is a bit confusing in my opinion so it is probably easiest to just look at it as stated above)

Now, with "input" I do not mean input on the input layer of the NN. As described in the appendix under "Neural network architecture" the input is a "19 x 19 x 17 image stack". which contains the following information:

  • The positions of player 1 for the latest 8 rounds (8 feature planes)
  • The positions of player 2 for the latest 8 rounds (8 feature planes)
  • A color feature indicating whose turn it is (1 feature planes)

And those 17 feature planes ($8+8+1$) combined with the $19\cdot19$ sized board is the $19\cdot19\cdot17$ input the NN receives thru its input layer. $\pi$ and $z$ are passed to the NN via the loss function only (i.e. they are the target values in this supervised learning problem!).

TLDR for your second question: $\pi$ and $z$ are not fed to the NN thru the input layer but just via the loss function as target values.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.