How batch normalization layer resolve the vanishing gradient problem?

According to this article:

https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

  • The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the gradient of big values will be close to zero.

  • The article suggests using batch normalization layer.

I can't understand how it can works?

When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values (before or after the normalization) will be placed near the edges and the gradient will be close to zero, am I right?

Topic gradient activation-function batch-normalization backpropagation deep-learning

Category Data Science


Batch Normalization (BN) does not prevent the vanishing or exploding gradient problem in a sense that these are impossible. Rather it reduces the probability for these to occur. Accordingly, the original paper states:

In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

If you have $m$ gradients in your current batch during training all of these will be normalized. Even if you still have some gradients which are close to the border regions afterwards, many won't. That is, BN reduces the number of gradients which might explode or vanish in this scenario.

Moreover, BN does not clip gradients to $[0,1]$ or $[-1,1]$. It normalizes the outputs of the linear transformations (i.e. usually not the activation values directly) by substracting the mean of a batch and dividing by its standard deviation. (if you relate this to preprocessing, its like standardization and not normalization by min-max scaling. So actually it's "Batch Standardization".)

On a side note: Vanishing and exploding gradients are not only a problem for sigmoid but tanh activation functions, too.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.