Understanding intution behind sigmoid curve in the context of back propagation

Question

Understanding intution behind sigmoid curve in the context of back propagation

Rnj

2021年9月26日 19:10

I was trying to understand significance of S-shape of sigmoid / logistic function. The slope/derivative of sigmoid approaches zero for very large and very small input values. That is $σ'(z) ≈ 0$ for $z 10$ or $z -10$. So update to weights will be smaller. Whereas updates will be bigger when $z$ is not too big or too small.

I dont get why its significant to have smaller updates when $z$ is too big and too small and bigger updates for not too big / not too small $z$. One reasoning I read is that it squashes outliers. But how very large and very small $z=wx+b$ indicate corresponding $x$ are outliers?
Also I was not able to map sigmoid derivative curve (in blue) to gradient descent curve below. Do these two curves relate to each other in any way? Should very large and very small $z$ in sigmoid curve coincide with global minima in the middle of GD curve?

Topic sigmoid backpropagation gradient-descent logistic-regression

Category Data Science

nt1245 · Accepted Answer · 2021年9月26日 19:10

Z value increases when the 'W' increases (Z = W.A + b). So, first why does w increase? i) Data is not normalized if applicable. i) Improper initialization of weights. ii) Bad Network Architecture leading to exploding gradients.

If we want to increase the speed of learning, we regularize the W's,or use BatchNorm, or basically center the data by normalizing the data before training. This is because the gradient value ~0 if W's are too high or too low as a result when updating W's, i.e W:= W - (lr)*(dL/dW) will be nearly equal to W. Thereby reducing the learning.

Answering your questions,

It is not significant to have lower updates, we don't want slower learning. This is the reason why there are momentum algorithms that boost the gradient descent. However, choosing a proper activation function still plays a major role (Ex: tanh centers the data better than sigmoid since, because of its mean being 0 compared to sigmoid)
You can refer to Back Propagation here which shows the effect of activation in gradient descent

Understanding intution behind sigmoid curve in the context of back propagation

About