How backpropagation works in case of 2 hidden layers?

Imagine the next structure (for simplicity, there's no bias, and activation formatting with sigmoid or relu, just weights). The input has two neurons, the two hidden layers have 3 neurons each, the output layer has two neurons, so a cost ($\sum C$) with two subcosts ($C^1$, $C^2$).

(I'm new at machine learning, and super confused with the different notations, and formatting, indexes, so to clarify, in case of activations, the upper index will show the index of it in the layer, and the lower index will show the layers index which is in, so the third neuron in the second layer will look like $a^2_1$ (0 starting). In weights, the upper index will show which is the index of neuron it's coming from, the first index of the lower index will show which is the index of neuron it's going to, and the second index of the lower index will show which is the index of layer it's going to, so the third layers first neurons weight, which connects it with the second layers first neuron will look like $w^0_{0,2}$.)

Just for the example, I would like to propagate the first line's weights back. Just to illustrate:

a0-w1-a1-w2-a2-w3-a3-y0-C0
ax    ax    ax    ax
      ax    ax
   

So to know the first weights (of the output layer) slope in the backpropagation process:

$$\frac{\partial\sum C}{\partial w^0_{0,3}} = \frac{\partial a^0_3}{\partial w^0_{0,3}}\frac{\partial\sum C}{\partial a^0_3}$$

As I learned, because $w^0_{0,3}$ doesn't affect $C^1$, just $C^0$, $\frac{\partial\sum C}{\partial a^0_3}$ is just $2(a^0_3-y^0)$. Not sure how can it be generalized though, to be used dynamically, not having to check if a weight has direct connection with all costs, or not.

As I somehow glued the tutorials together, calculating the second layers (backwards) first weight:

$$\frac{\partial\sum C}{\partial w^0_{0,2}} = \frac{\partial a^0_2}{\partial w^0_{0,2}}\frac{\partial\sum C}{\partial a^0_2}$$

Where $w^0_{0,2}$ has connection with $a^0_2$, which has connection with all neurons of the output layer, therefore, to both costs, so:

$$\frac{\partial\sum C}{\partial a^0_2}=\frac{\partial C^0}{\partial a^0_2}+\frac{\partial C^1}{\partial a^0_2}$$

Where: $$\frac{\partial C^0}{\partial a^0_2}=\frac{\partial a^0_3}{\partial a^0_2}\frac{\partial C^0}{\partial a^0_3}$$ and: $$\frac{\partial C^1}{\partial a^0_2}=\frac{\partial a^1_3}{\partial a^0_2}\frac{\partial C^0}{\partial a^1_3}$$

Clear so far, what I don't understand, and can't split down from this point is: $$\frac{\partial\sum C}{\partial w^0_{0,1}} = \frac{\partial a^0_1}{\partial w^0_{0,1}}\frac{\partial\sum C}{\partial a^0_1}$$

Because $a^0_1$ has connections to three neurons, in the second layer backwards, in other words

$$\frac{\partial C^0}{\partial a^0_1}=\frac{\partial a^0_2}{\partial a^0_1}\frac{\partial C^0}{\partial a^0_2}$$

can't be true, as $a^0_1$ affects $C^0$ not just through $a^0_2$, but $a^1_2$, and $a^2_2$ as well. So how this can be solved? Do I have to add them up, or it needs to be solved in a different way than the previous solutions?

Topic mathematics backpropagation machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.