How exactly does DropOut work with convolutional layers?

Dropout (paper, explanation) sets the output of some neurons to zero. So for a MLP, you could have the following architecture for the Iris flower dataset:

4 : 50 (tanh) : dropout (0.5) : 20 (tanh) : 3 (softmax)

It would work like this:

$$softmax(W_3 \cdot \tanh(W_2 \cdot \text{mask}(D, \tanh(W_1 \cdot input\_vector)))$$

with $input\_vector \in \mathbb{R}^{4 \times 1}$, $W_1 \in \mathbb{R}^{50 \times 4}$, $D \in \{0, 1\}^{50 \times 1}$, $W_2 \in \mathbb{R}^{20 \times 50}$, $W_3 \in \mathbb{R}^{20 \times 3}$ (ignoring biases for the sake of simplictiy).

With $D = (d)_{ij}$ and

$$d_{ij} \sim B(1, p=0.5)$$

where the $\text{mask}(D, M)$ operation multiplies $D$ point-wise with $M$ (see Hadamard product).

Hence we just sample the matrix $D$ each time and thus the dropout becomes a multiplication of a node with 0.

But for CNNs, it is not clear to me what exactly is dropped out. I can see three possibilities:

  1. Dropping complete feature maps (hence a kernel)
  2. Dropping one element of a kernel (replacing an element of a kernel by 0)
  3. Dropping one element of a feature map

Please add a reference / quote to your answer.

My thoughts

I think Lasagne does (3) (see code). This might be the simplest to implement. However, closer to the original idea might be (1).

Seems to be similar for Caffe (see code). For tensorflow, the user has to decide (code - I'm not sure what happens when noise_shape=None is passed).

How it should be

(2) and (3) don't make much sense as it would cause the network to add invariance to spacial positions, which is probably not desired. Hence (1) is the only variant which makes sense. But I'm not sure what happens if you use the default implementation.

Topic dropout

Category Data Science


Dropout is used to improve the generalization performance of the model. Generalization is achieved by making the learning features independent and not heavily correlated.

Natural images are highly correlated (the image is a spatial data structure). The feature maps in CNNs also exhibit a strong correlation.

pixel and its surrounding pixels. Surrounding pixels across the feature maps. To avoid this is to dropout an entire feature map (picked randomly with a probability of dropout_rate). For example, if there are N feature maps, the shape would be (N x Height x Width). With a dropout of 0.5, you force activations of N/2 feature maps to "0".In this way, correlation is alleviated across the feature maps, but there is a correlation within a single feature map which is necessary for efficient object localization.

Refernce: https://arxiv.org/pdf/1411.4280.pdf (See page 3)


As you mentioned, the mask matrix is sampled and multiplied with the activations in the feature map at layer $l$ to produce dropped out modified activations which are then convolved with the filter at the next layer $W^{(l+1)}$. (3)

For more details, I think section 3 in this paper might help you out: Max-pooling & Convolutional dropout. Specifically 3.2.

When you test you use all nodes of the network but with the filter’s weights scaled by the retaining probability, as explained in the paper.

Please feel free to refine or correct my answer.

Hope this helps at least a little.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.