Where Does the Normal Glorot Initialization Come from?

Question

Where Does the Normal Glorot Initialization Come from?

Hermi

2021年9月13日 08:46

The famous Glorot initialization is described first in the paper Understanding the difficulty of training deep feedforward neural networks. In this paper, they derive the following uniform initialization, cf. Eq. (16) in their paper: \begin{equation} W \sim U\left[ -\frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}\right]. \tag{16}\end{equation}

If we take a look at the PyTorch documentation for weight initialization, then there are two Glorot (Xavier) initializations, namely torch.nn.init.xavier_uniform_(tensor, gain=1.0) and torch.nn.init.xavier_normal_(tensor, gain=1.0). According to the documentation, the initialization for the latter is given by the normal distribution $\mathcal N(0, \sigma^2)$, where the standard deviation is given by $$ \sigma = \sqrt{\frac{2}{n_{j} + n_{j+1}}}.$$

Questions:

1.) Why do we have a $2$ instead of a $6$ in the standard deviation of the normal Glorot initialization?

2.) Where does the normal Glorot initialization come from? So basically, was there a follow-up paper to the above mentioned paper that demonstrated a superiority of the normal Glorot in comparison to the uniform initialization?

Thanks!

Topic weight-initialization deep-learning neural-network

Category Data Science

Where Does the Normal Glorot Initialization Come from?

About