Dropout onto pre-weighted vs onto pre-activated vector?
For any layer in my neural net, should I apply dropout onto an entering vector, or on the pre-activated vector?
In other words:
$$\vec q=W\cdot \vec x$$ $$\vec h = activate(drop(\vec q))$$
or:
$$\vec q=W\cdot (drop(\vec x)) $$ $$ \vec h = activate(\vec q)$$
I think the second variant is smoother (none of our current vector is fully dropped out, but is assembled from a mix of the dropped-out input) and is therefore softer.
Topic mathematics dropout
Category Data Science