Vanishing gradient problem even after existence of ReLu function?

Let's say I have a deep neural network with 50 hidden layers and at each neuron of hidden layer the ReLu activation function is used. My question is

  • Is it possible for vanishing gradient problem to get occur during the backpropogation for weights updates even after the existence of relu?
  • or we can say that vanishing gradient problem will never occur when all the activation functions are ReLu?

Topic cnn gradient-descent deep-learning neural-network

Category Data Science


It can always happen , You see if the Weights are really tiny numbers close to zero, gradients are just the same if the dot product per neuron is positive then the gradients are just equal to the weights of that layer which can be small or if its negative , then the gradients are exactly equal to zero , small enough , so the answer to your question is Yes I think , The chances ofcourse are way better than something like sigmoid . But saying that it will never happen is I think totally wrong.


Are you talking about LeakyReLU by chance and not ReLU? Because ReLU is known for vanishing gradients, since any values less than zero are mapped to zero. This is true regardless of the number of layers. LeakyReLU on the other hand, maps the values less than zeros to a very small positive number. This prevents vanishing gradient from occurring.

EDIT: LeakyReLU prevents dying ReLU from occurring not vanishing gradient. PReLU prevents vanishing gradient from occurring.

EDIT 2: To answer the comments. VGG was proposed in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" and was one of the top performance models for the ImageNet challenge at its time. Architecture wise, VGG wasn't completely different from what was done in the past. However, it was much deeper. This is impart where the vanishing gradient becomes a problem. It does not really have to do with ReLU alone but a combination of every single layer.

Enter ResNet, which uses skip connections. These actually make parts of the networks shallow and makes it easier for the network to learn both easy and difficult tasks (ie. low and high frequencies in images). More difficult tasks require more learnable parameters while easier tasks require fewer parameters.

I believe PReLU being a learnable activation function can help deal with this task.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.