How batch normalization layer resolve the vanishing gradient problem?
According to this article:
https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484
The vanishing gradient problem occurs when using the
sigmoid
activation function becausesigmoid
maps large input space into small space, so the gradient of big values will be close to zero.The article suggests using batch normalization layer.
I can't understand how it can works?
When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values (before or after the normalization) will be placed near the edges and the gradient will be close to zero, am I right?