How to compute backpropagation gradient according chain rule for using vector/matrix differential?

Question

How to compute backpropagation gradient according chain rule for using vector/matrix differential?

Grigogiy Reznichenko

2022年3月11日 23:09

I have some problems for computing derivative for sum of squares error in backprop neural network. For example, we have a neural network as in picture. For drawing simplicity, i've dropped the sample indexes.

Сonventions:

x - data_set input.
W - is a weigth matrix.
v - vector of product: W*x.
F - activation function vector.
y - vector of activated data
D - vector of answers
e - error signal
lower index is a variable(NxN) - dimenstionality
higher [index] - is a layer number.

Jacobian is defined as:

\begin{pmatrix}\frac{\partial f_1(x_1)}{\partial x_1}\dots\frac{\partial f_1(x_N)}{\partial x_N}\\\vdots\ddots\vdots\\\frac{\partial f_M(x_1)}{\partial x_1}\dots\frac{\partial f_M(x_N)}{\partial x_N}\end{pmatrix}

Let's have a look at the second layer and let's find derivative according chain rule:

Optimization error is sum of squared errors:

The chain rule for second layer is :

I'm going to derivate according rules founded here: https://fmin.xyz/docs/theory/Matrix_calculus/

- 1xN - dimension

- NxN matrix

- NxN matrix (look at Jacobian)

- Nx2N Tensor product (try yourself for 2x2)

So the differential is:

(You can verify equations according this service http://www.matrixcalculus.org/ or yourself using 2x2 dimension)

Let' look at dimenstions: (1xN) x (NxN) x (NxN) x (Nx2N) = (1x2N)

When we try to use gradient descent

The dimenstions will not be the same: (NxN) = (NxN) - (1x2N)

Therefore, we cannot update the coefficients according this.

Where did i make a mistake? Any comments are welcomed. Thanks!

Topic derivation matrix backpropagation neural-network machine-learning

Category Data Science

How to compute backpropagation gradient according chain rule for using vector/matrix differential?

About