Why does adding a dropout layer improve deep/machine learning performance, given that dropout suppresses some neurons from the model?

Question

Why does adding a dropout layer improve deep/machine learning performance, given that dropout suppresses some neurons from the model?

user781486

2020年7月11日 22:35

If removing some neurons results in a better performing model, why not use a simpler neural network with fewer layers and fewer neurons in the first place? Why build a bigger, more complicated model in the beginning and suppress parts of it later?

Topic keras dropout regularization deep-learning machine-learning

Category Data Science

n1k31t4 · Accepted Answer · 2020年7月11日 22:35

The function of dropout is to increase the robustness of the model and also to remove any simple dependencies between the neurons.

Neurons are only removed for a single pass forward and backward through the network - meaning their weights are synthetically set to zero for that pass, and so their errors are as well, meaning that the weights are not updated. Dropout also works as a form of regularisation, as it is penalising the model for its complexity, somewhat.

I would recommend having a read of the Dropout section in Michael Nielsen's Deep Learning book (freely available), which gives nice intuition and also has very helpful interactive diagrams/explanations. He explains that:

Dropout is a radically different technique for regularization. Unlike L1 and L2 regularization, dropout doesn't rely on modifying the cost function. Instead, in dropout we modify the network itself.

Here is a nice summary article. From that article:

Some Observations:

Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
With H hidden units, each of which can be dropped, we have 2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

Example

Imagine I ask you to make me a cup of tea - you might always use your right hand to pour the water, your left eye to measure the level of water and then your right hand again to stir the tea with a spoon. This would mean your left hand and right eye serve little purpose. Using dropout would e.g. tie your right hand behind your back - forcing you to use your left hand. Now after making me 20 cups of tea, with either one eye or one hand taken out of action, you are better trained at using everything available. Maybe you will later be forced to make tea in a tiny kitchen, where it is only possible to use the kettle with your left arm... and after using dropout, you have experience doing that! You have become more robust to unseen data.

hafiz031 · Accepted Answer · 2020年3月17日 05:26

Dropout helps improving performance of a machine learning model for the following reasons:

Making Network Simpler: It makes the network simpler hence, prevents over fitting.
Better than Using a Single Simple Network: It is better than manually re-designing a simpler network because once you have designed a particular architecture, you cannot change it until the whole training process ends i.e, for all the epochs the network architecture is fixed. But in dropout the network is being simpler in various ways in each epoch. Say for 1000 epochs you are actually trying with 1000 types of simpler network's variations.
Learn in Many Ways: The input and output of the network don't get changed, the only thing is changing is the mapping between them. So just imagine that, the network is learning the same thing in various different ways. So just like this network, for us the humans- whenever we think of the same problem in different ways we automatically learn to generalize it and our overall knowledge and understanding also improves and the similar thing occurs to the network. As during the dropout process in each epoch randomly some weights (connection from a neuron to another neuron of next layer) are getting cut, hence, we are forcing the network to learn using the existing connections that are still available and thus the network is learning how to analyze the same problem from different perspectives.

Pranav Vempati · Accepted Answer · 2018年8月16日 23:36

The dropout layer indiscriminately culls a specified portion of neurons, decreasing the representational capacity of the model in question. This prevents the network from fitting complex nonlinear decision boundaries(i.e. the "noise" in the dataset), thus preventing(or ameliorating) overfitting.

Björn · Accepted Answer · 2018年8月16日 18:56

Another way of looking at what dropout does is that it is like a slab-and-spike prior for the coefficient for a covariate (that is some complex interaction term of the original covariates with some complicated functional transformations) in a Bayesian model. This is the interpretation proposed by Yarin Gal in his thesis (see his list of publications).

Here is a brief hand-waving argument for why this is so:

In those batches, where a neuron is eliminated, the coefficient for feature/covariate (constructed by connection in the neural network going into the neuron) is zero (spike at zero).
In those batches, where the neuron is present, the coefficient is unrestricted (improper flat prior = slab).
Averaged across all batches, you get a spike-and-slab prior.

Why would we want a slab-and-spike prior? It induces a Bayesian model averaging between a neutral network without that neuron and one with it in. In other words, it lets us express uncertainty about whether the neutral network really needs to have its full possible complexity and appropriately takes this uncertainty into account in the predictions. This addresses the major issue of neutral networks being able to overfit to data (though of course it is not the only possible way to achieve that).

Ankit Seth · Accepted Answer · 2018年8月16日 13:05

Dropout does not actually removes neurons, its just that those particular neurons don't play any role (don't get activated) for the given batch of data.

Example - Suppose there is a road of 8 lanes - When Trucks come, they pass through lanes 1,2,4,6,7, when Cars come, they pass through lanes 2,3,4,7,8 and when Bikes come, they pass through lanes 1,2,5,8. So regardless of any vehicle, all lanes are there, but only some of them are used.

Similarly, all neurons are used in whole model, but only a subset of neurons are activated for a particular batch of data. And the model is not cut down later, the model complexity remains as it is.

Why to use dropout?

As given in Deep learning book by Ian Goodfellow,

dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints and sparse activity regularization.

He also says-

One advantage of dropout is that it is very computationally cheap.

Another significant advantage of dropout is that it does not significantly limit the type of model or training procedure that can be used. It works well with nearly any model that uses a distributed representation and can be trained with stochastic gradient descent. This includes feedforward neural networks, probabilistic models such as restricted Boltzmann machines (Srivastava et al., 2014), and recurrent neural networks (Bayer and Osendorfer, 2014; Pascanu et al., 2014a).

This book says-

The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren’t significant, which the network will start memorizing if no noise is present.

Why does adding a dropout layer improve deep/machine learning performance, given that dropout suppresses some neurons from the model?

Example

About