Does Batch Normalization make sense for a ReLU activation function?

Batch Normalization is described in this paper as a normalization of the input to an activation function with scale and shift variables $\gamma$ and $\beta$. This paper mainly describes using the sigmoid activation function, which makes sense. However, it seems to me that feeding an input from the normalized distribution produced by the batch normalization into a ReLU activation function of $max(0,x)$ is risky if $\beta$ does not learn to shift most of the inputs past 0 such that the ReLU isn't losing input information. I.e. if the input to the ReLU were just standard normalized, we would lose a lot of our information below 0. Is there any guarantee or initialization of $\beta$ that will guarantee that we don't lose this information? Am I missing something with how the operation of BN and ReLU work?

Topic batch-normalization deep-learning neural-network machine-learning

Category Data Science


In the first step of batch normalization the data is zero-centered by subtracting the mean. If there was a bias term before this point, it is now lost. The role of the beta variable is to recover the bias term. All the risks mentioned would keep existing without batch normalization as long as there is a bias term in the input. A suggested approach for bias initialization before ReLU is to set it to a small positive value such as 0.01, which can be tried for the beta term in this case (though it is not guaranteed that it will help).

More on this: https://stackoverflow.com/questions/44883861/initial-bias-values-for-a-neural-network)

Putting the risk of bias aside, having batch normalization before the ReLU can help to keep it "alive". If large gradients are backpropagated, or a large learning rate is chosen, weights would receive large updates. By chance these updates can push the input of ReLU too far in the negative direction such that ReLU dies never being able to fire again. In this case batch normalization would make sense because it would push back the center of the input distribution towards a positive direction. The amount of push can be decided during learning by updating the beta and gamma parameters. Whether before or after the ReLU is the best position for a batch normalization, is another topic for discussion: putting it after the ReLU is also useful because it can zero center the data which also has its advantages, e.g. better convergence by allowing gradient updates in every direction.


I'd say BN goes after the ReLU and not before, in general it should be put between 2 layers so to normalize the layer output PDF before becoming another layer input

The convolutive layer processing is composed of a Lin (Conv Operator) + NonLin (e.g. ReLU) processing (as the Artificial Neuron Processing) and a sparsifying nonlin like ReLU produces an output PDF which is non-negative as a result of filtering, so before passing it as the next layer input the BN can help renormalizing it


That is known a problem with the ReLU activation functions. It is often called a "dying ReLU". Given an input over the zero boundary, the unit is now almost always closed. A closed ReLU cannot update its input parameters, a dead ReLU stays dead.

The solution is to use variants of ReLU for the activation function such as Leaky ReLU, Noisy ReLUs, or ELUs.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.