Does Batch Normalization make sense for a ReLU activation function?
Batch Normalization is described in this paper as a normalization of the input to an activation function with scale and shift variables $\gamma$ and $\beta$. This paper mainly describes using the sigmoid activation function, which makes sense. However, it seems to me that feeding an input from the normalized distribution produced by the batch normalization into a ReLU activation function of $max(0,x)$ is risky if $\beta$ does not learn to shift most of the inputs past 0 such that the ReLU isn't losing input information. I.e. if the input to the ReLU were just standard normalized, we would lose a lot of our information below 0. Is there any guarantee or initialization of $\beta$ that will guarantee that we don't lose this information? Am I missing something with how the operation of BN and ReLU work?
Topic batch-normalization deep-learning neural-network machine-learning
Category Data Science