Where Does the Normal Glorot Initialization Come from?
The famous Glorot initialization is described first in the paper Understanding the difficulty of training deep feedforward neural networks. In this paper, they derive the following uniform initialization, cf. Eq. (16) in their paper: \begin{equation} W \sim U\left[ -\frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}\right]. \tag{16}\end{equation}
If we take a look at the PyTorch documentation for weight initialization, then there are two Glorot (Xavier) initializations, namely torch.nn.init.xavier_uniform_(tensor, gain=1.0)
and torch.nn.init.xavier_normal_(tensor, gain=1.0)
. According to the documentation, the initialization for the latter is given by the normal distribution $\mathcal N(0, \sigma^2)$, where the standard deviation is given by
$$ \sigma = \sqrt{\frac{2}{n_{j} + n_{j+1}}}.$$
Questions:
1.) Why do we have a $2$ instead of a $6$ in the standard deviation of the normal Glorot initialization?
2.) Where does the normal Glorot initialization come from? So basically, was there a follow-up paper to the above mentioned paper that demonstrated a superiority of the normal Glorot in comparison to the uniform initialization?
Thanks!
Topic weight-initialization deep-learning neural-network
Category Data Science