How exactly does DropOut work with convolutional layers?
Dropout (paper, explanation) sets the output of some neurons to zero. So for a MLP, you could have the following architecture for the Iris flower dataset:
4 : 50 (tanh) : dropout (0.5) : 20 (tanh) : 3 (softmax)
It would work like this:
$$softmax(W_3 \cdot \tanh(W_2 \cdot \text{mask}(D, \tanh(W_1 \cdot input\_vector)))$$
with $input\_vector \in \mathbb{R}^{4 \times 1}$, $W_1 \in \mathbb{R}^{50 \times 4}$, $D \in \{0, 1\}^{50 \times 1}$, $W_2 \in \mathbb{R}^{20 \times 50}$, $W_3 \in \mathbb{R}^{20 \times 3}$ (ignoring biases for the sake of simplictiy).
With $D = (d)_{ij}$ and
$$d_{ij} \sim B(1, p=0.5)$$
where the $\text{mask}(D, M)$ operation multiplies $D$ point-wise with $M$ (see Hadamard product).
Hence we just sample the matrix $D$ each time and thus the dropout becomes a multiplication of a node with 0.
But for CNNs, it is not clear to me what exactly is dropped out. I can see three possibilities:
- Dropping complete feature maps (hence a kernel)
- Dropping one element of a kernel (replacing an element of a kernel by 0)
- Dropping one element of a feature map
Please add a reference / quote to your answer.
My thoughts
I think Lasagne does (3) (see code). This might be the simplest to implement. However, closer to the original idea might be (1).
Seems to be similar for Caffe (see code). For tensorflow, the user has to decide (code - I'm not sure what happens when noise_shape=None
is passed).
How it should be
(2) and (3) don't make much sense as it would cause the network to add invariance to spacial positions, which is probably not desired. Hence (1) is the only variant which makes sense. But I'm not sure what happens if you use the default implementation.
Topic dropout
Category Data Science