Why is it valid to remove a constant factor from the derivative of an error function?

Question

Why is it valid to remove a constant factor from the derivative of an error function?

Dhruv Agarwal

2020年9月19日 12:21

I was reading the book 'Make your own neural network' by Tariq Rashid. In his book, he said:

(Note - He's talking about normal feed forward neural networks)

The $t_k$ is the target value at node $k$, the $O_k$ is the predicted output at node $k$, $W_{jk}$ is the weight connecting the node $j$ and $k$ and the $E$ is the error at node $k$

Then he says that, we can remove the 2 because we only care about the direction of the slope of the error function and it's just a scaling factor. So, can't we remove $sigmoid($$\sum_{j}$$ W_{jk}. O_j)$, as we know it would be between $0$ and $1$, and so it would also just act as a scaling factor. If you then see, we can remove everything after $(t_k-O_k)$, as we know the whole expression would be between $0$ and $1$, and so it would just act as a scaling factor. So that leaves us with just:

$$t_k-O_k$$

which is definitely the wrong derivative.

If we can't remove that whole expression, then why did he removed the $2$, as they both were scaling factors?

Topic derivation deep-learning neural-network machine-learning

Category Data Science

Graph4Me Consultant · Accepted Answer · 2020年9月19日 09:59

You can remove the factor because:

It is constant with respect to the variable you compute the derivate on.
The constant is positive
You have this constant factor for all variables you compute the derivative on. So if $\nabla f(\mathbf{W})$ would be the correct gradient, by setting the constant factor to $1$, you get $\frac{1}{2} \nabla f(\mathbf{W})$.

When you use (stochastic) gradient descent, you scale the gradient anyway (with the learning rate / step size). So it is important to have the correct gradient, up to a scaling factor (which must be independent of the variables, and positive).

You could also set the scaling factor $\frac{1}{2}$ already in the loss function $L$. So if $L$ is the unscaled loss and $L' = \frac{1}{2}L$, then you have $L(\mathbf{w})$ is minimal if and only if $L'(\mathbf{w})$ is minimal.

Why is it valid to remove a constant factor from the derivative of an error function?

About