Why should the initialization of weights and bias be chosen around 0?

Question

Why should the initialization of weights and bias be chosen around 0?

cinqS

2021年3月11日 20:17

I read this:

To train our neural network, we will initialize each parameter W(l)ijWij(l) and each b(l)ibi(l) to a small random value near zero (say according to a Normal(0,ϵ2)Normal(0,ϵ2) distribution for some small ϵϵ, say 0.01)

from Stanford Deep learning tutorials at the 7th paragraph in the Backpropagation Algorithm

What I don't understand is why the initialization of the weight or bias should be around 0?

Topic stanford-nlp randomized-algorithms deep-learning

Category Data Science

Landmaster · Accepted Answer · 2021年3月11日 20:17

1

Landmaster answered at 2021年3月11日 20:17

If you set it as 0, they will all have the same error so backprop will make them all equal; therefore, you should have random initialisation.

Why around 0? I think this post may answer it well:

Eumenedies · Accepted Answer · 2017年8月9日 09:09

Assuming fairly reasonable data normalization, the expectation of the weights should be zero or close to it. It might be reasonable, then, to set all of the initial weights to zero because a positive initial weight will have further to go if it should actually be a negative weight and visa versa. This, however, does not work. If all of the weights are the same, they will all have the same error and the model will not learn anything - there is no source of asymmetry between the neurons.

What we could do, instead, is to keep the weights very close to zero but make them different by initializing them to small, non-zero numbers. This is what is suggested in the tutorial that you linked. It has the same advantage of all-zero initialization in that it is close to the 'best guess' expectation value but the symmetry has also been broken enough for the algorithm to work.

This approach has additional problems. It is not necessarily true that smaller numbers will work better, especially if the neural network is deep. The gradients calculated in backpropagation are proportional to the weights; very small weights lead to very small gradients and can lead to the network taking much, much longer to train or never completing.

Another potential issue is that the distribution of the outputs of each neuron, when using random initialization values, has a variance that gets larger with more inputs. A common additional step is to normalize the neuron's output variance to 1 by dividing its weights by $sqrt(d)$ where $d$ is the number of inputs to the neuron. The resulting weights are normally distributed between $\left[\frac{-1}{\sqrt{d}}, \frac{1}{\sqrt{d}}\right]$

Why should the initialization of weights and bias be chosen around 0?

About