Vanishing Gradient vs Exploding Gradient as Activation function?
ReLU is used as an activation function that serves two purposes:
- Breaking linearity in DNN.
- Helping in handling Vanishing Gradient problem.
For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of Gradient, similarly to ReLU which set the minimum Gradient limit which is 0.
So far I read, ReLU is considered as an activation function. In a similar fashion, can we use Gradient Clipping also as an activation function? If yes, any pros and cons of using it?
Topic gradient activation-function
Category Data Science