How batch normalization layer resolve the vanishing gradient problem?

Question

How batch normalization layer resolve the vanishing gradient problem?

user3668129

2021年6月4日 22:13

According to this article:

https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the gradient of big values will be close to zero.
The article suggests using batch normalization layer.

I can't understand how it can works?

When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values (before or after the normalization) will be placed near the edges and the gradient will be close to zero, am I right?

Topic gradient activation-function batch-normalization backpropagation deep-learning

Category Data Science

Sammy · Accepted Answer · 2021年6月2日 07:06

Batch Normalization (BN) does not prevent the vanishing or exploding gradient problem in a sense that these are impossible. Rather it reduces the probability for these to occur. Accordingly, the original paper states:

In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

If you have $m$ gradients in your current batch during training all of these will be normalized. Even if you still have some gradients which are close to the border regions afterwards, many won't. That is, BN reduces the number of gradients which might explode or vanish in this scenario.

Moreover, BN does not clip gradients to $[0,1]$ or $[-1,1]$. It normalizes the outputs of the linear transformations (i.e. usually not the activation values directly) by substracting the mean of a batch and dividing by its standard deviation. (if you relate this to preprocessing, its like standardization and not normalization by min-max scaling. So actually it's "Batch Standardization".)

On a side note: Vanishing and exploding gradients are not only a problem for sigmoid but tanh activation functions, too.

How batch normalization layer resolve the vanishing gradient problem?

About