There are different reasons. First of all, there is a general rule. Try to initialize your weights to be near zero, but avoid setting them to values that are too small. If you normalize your inputs and initialize your weights in this way, your cost function will somehow be a rounded cost function and it will be elongated.

One way is to sample from the uniform and another way is to sample from a Gaussian. Uniform picks values in a range with the same probability, while the Gaussian chooses values near mean, zero, with more probability. Consequently, the Gaussian is better.

One of the drawbacks of the usual Gaussian is that large values can be selected. In large networks, there are many parameters. So we may pick many weights with large values. large weight values are not good. They will lead to overfitting and will slow down the training process. So, people try to resample if a large value appears. To do so, they specify a threshold for the numbers. This approach is called the truncated method which can be applied to Gaussian distribution.


The goal of initial weight distribution is get some amount of variance to allow learning to take place. Many distributions could work. Gaussian or uniform are the most commonly used.

Cauchy distribution would not be a useful choice for several reasons. The primary reason is the Cauchy distribution is not as common as Gaussian or uniform distribution, thus is not as well supported in programming environments. Another reason could be Cauchy distribution's variance is undefined.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.