AlphaGo Zero loss function
As far as I understood from the AlphaGo Zero system:
- During the self-play part, the MCTS algorithm stores a tuple ($s$, $\pi$, $z$) where $s$ is the state, $\pi$ is the distribution probability over the actions in the state and $z$ is an integer representing the winner of the game that state is in.
- The network will receive $s$ as input (a stack of matrices describing the state $s$) and will output two values: $p$ and $v$. $p$ is a distribution probability over the actions and $v$ is a value in $[-1,1]$ representing which player is likely to win the game.
- For the training it will use the following loss function:
$$l = (z - v)^2 - \pi^T log(p) + c ||\theta||^2$$
- Lastly, it evaluates the new network and it starts the self-play section again.
My questions
If the network receives only the state $s$ (represented as matrices) as input, how can it then calculate the loss function if the values $\pi$ and $z$ are needed?
If these values are indeed passed as input for the network, are they passed through the convolutional (and other) layers of the network? Because if this is true, there is no mention in the article (unless I missed it) of this information.
Topic deepmind keras tensorflow loss-function deep-learning
Category Data Science