Vanishing Gradient vs Exploding Gradient as Activation function?

ReLU is used as an activation function that serves two purposes:

  1. Breaking linearity in DNN.
  2. Helping in handling Vanishing Gradient problem.

For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of Gradient, similarly to ReLU which set the minimum Gradient limit which is 0.

So far I read, ReLU is considered as an activation function. In a similar fashion, can we use Gradient Clipping also as an activation function? If yes, any pros and cons of using it?

Topic gradient activation-function

Category Data Science


ReLU is considered as an activation function, on similar fashion can we use Gradient Clipping also as an activation function?

ReLU is an activation function. Gradient clipping is a technique to keep the problem of exploding gradient at bay.

I wish also to stress that the best technique to control for vanishing/exploding gradients is, at the moment, batch normalization. Dropout (a technique born to fight overfitting) also has a similar regularization effect - by forcing the model to distribute weights more evenly through the layer. That's why you don't see gradient clipping that often as it used to.


EDIT:

I forgot to mention that a proper scaling of your variables and appropriate weight initializations make the problem of vanishing/exploding gradient not very frequent. This of course is purely based on personale experience. It's still very important to take it into account

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.