Convolutional layer dropout layer in keras

According to classical paper

http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

dropout operation affects not only training step but also test step - we need to multiply all neuron output weights by probability p.

But in keras library I found the following implementation of dropout operation:

retain_prob = 1. - level

...

random_tensor = rng.binomial(x.shape, p=retain_prob, dtype=x.dtype)

...

x *= random_tensor
x /= retain_prob
return x

(see https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py)

Why x is divided by retain_prob while it should be multiplied? Or I'm just confused, and multiplying weights is equivelent to dividing output value?

Topic keras dropout theano neural-network machine-learning

Category Data Science


Let's clarify few things about dropout. And Neil Slater is to credit for this answer since his comments helped formulate a more clear explanation.

First of all, dropout is a regularization method, it is usually only applied during training (although it can be used in prediction as an approximation to a Bayesian Neural Network as is explained by Yarin Gal's paper). As you might have understood, its goal is to limit overfitting. That is, to train a model that will generalize better to newly unseen data samples. So it has nothing to do with testing but everything to do with training.

Second, the reason why you might have seen the output multiplied by $p$ at prediction time is a trick used with the very basic implementation of dropout referred to as vanilla dropout. At prediction time (or testing time if you prefer this wording) no need to drop anymore, but it is needed to scale the outputs by $p$. The reason is that because at training time, dropout was perform with a probability $p$, then the output need to be scaled to adjust for $p$ at prediction time.

Third, inverted dropout (which is the dropout implementation used in all serious DL libraries) does not need to scale the output at prediction time because scaling by $p$ is actually already performed at training time (by dividing by $p$). Therefore, no need for any trick at prediction time!

Finally, the concept is nicely explained in these videos from the Udacity Deep Learning course or this course from Stanford.

I hope this is clear enough. :)


You are looking at the Keras code implementing dropout for training step.

In the Keras implementation, the output values are corrected during training (by dividing, in addition to randomly dropping out the values) instead of during testing (by multiplying). This is called "inverted dropout".

Inverted dropout is functionally equivalent to original dropout (as per your link to Srivastava's paper), with a nice feature that the network does not use dropout layers at all during test and prediction. This is explained a little in this Keras issue.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.