What are the reasons for drawing initial neural network weights from the Gaussian distribution?

Question

What are the reasons for drawing initial neural network weights from the Gaussian distribution?

Stephane Bersier

2022年5月17日 08:04

Are there theoretical or empirical reasons for drawing initial weights of a multilayer perceptron from a Gaussian rather than from, say, a Cauchy distribution?

Topic weight-initialization gaussian deep-learning neural-network machine-learning

Category Data Science

Green Falcon · Accepted Answer · 2020年12月14日 19:29

There are different reasons. First of all, there is a general rule. Try to initialize your weights to be near zero, but avoid setting them to values that are too small. If you normalize your inputs and initialize your weights in this way, your cost function will somehow be a rounded cost function and it will be elongated.

One way is to sample from the uniform and another way is to sample from a Gaussian. Uniform picks values in a range with the same probability, while the Gaussian chooses values near mean, zero, with more probability. Consequently, the Gaussian is better.

One of the drawbacks of the usual Gaussian is that large values can be selected. In large networks, there are many parameters. So we may pick many weights with large values. large weight values are not good. They will lead to overfitting and will slow down the training process. So, people try to resample if a large value appears. To do so, they specify a threshold for the numbers. This approach is called the truncated method which can be applied to Gaussian distribution.

Brian Spiering · Accepted Answer · 2020年11月12日 15:50

The goal of initial weight distribution is get some amount of variance to allow learning to take place. Many distributions could work. Gaussian or uniform are the most commonly used.

Cauchy distribution would not be a useful choice for several reasons. The primary reason is the Cauchy distribution is not as common as Gaussian or uniform distribution, thus is not as well supported in programming environments. Another reason could be Cauchy distribution's variance is undefined.

What are the reasons for drawing initial neural network weights from the Gaussian distribution?

About