How backpropagation through gradient descent represents the error after each forward pass

Question

How backpropagation through gradient descent represents the error after each forward pass

Katherine

2021年7月9日 12:10

In Neural NEtwork Multilayer Perceptron, I understand that the main difference between Stochastic Gradient Descent (SGD) vs Gradient Descent (GD) lies in the way of how many samples are chosen while training. That is, SGD iteratively chooses one sample to perform forward pass followed by backpropagation to adjust the weights, as oppose to GD where the backpropagation starts only after the entire samples have been calculated in the forward pass).

My question is: When the Gradient Descent (or even mini-batch Gradient Descent) is the chosen approach, how do we represent the error from a single forward pass? Assuming that my network has only a single output neuron, is the error represented by averaging all the individual errors from each sample or by summing all of them? To me this sounds like an implementation-dependent manner but I want to know if there is such a conventional way of that.

Topic mini-batch-gradient-descent gradient-descent scikit-learn neural-network machine-learning

Category Data Science

Kari · Accepted Answer · 2021年7月9日 12:10

To get total error before back propagating - it is common to take an average of all the forward-pass errors. This is what's done in RNN such as LSTM.

In the case of linear regression and logistic regression, The traditional Mean Squared Error Function can produce such a value.

In essence, this value is represented by an average of errors: $Y(w) = 1/n{\sum_{i=1}^n Y_i(w)}$

Also, as a reminder, speaking of an actual backpropagation - from wikipedia:

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations:

$$w:=w - {\alpha}\nabla Y(w) $$

which is basically

$$w:= w - \alpha{\sum_{i=1}^n} \nabla Y_i(w)/n $$

notice the $/n$ When used with the ${\sum_{i=1}^n}$ it results in the average of all gradients

:= means 'becomes qual to'

$\alpha$ is the learning rate

How backpropagation through gradient descent represents the error after each forward pass

About