Is saddle point a cause for the vanishing gradient problem
I am a beginner to neural networks and I am writing a report summarising on the causes and solutions to the vanishing gradient problem. From what I have read, the 2 main causes are the repeated multiplication of saturated activation function derivatives and repeated multiplication of large weights from bad initialisation. I tend to consider both of them as the poor choice of neural network components, leading to computational troubles.
Additionally the proliferation of saddle points on the cost function surface in high-dimensional problems is another potential cause for zero gradient. However I am not entirely sure if I should include it as one of the causes for the vanishing gradient problem. Because it sounds like the nature of non-convex cost function, as being attractive to the gradient descent direction.
It will be greatly appreciated if someone can offer some structured ideas on this topic. Thanks in advance.
Topic weight-initialization gradient-descent
Category Data Science