Doubt in Derivation of Backpropagation

I was going through the derivation of backpropagation algorithm provided in this document (adding just for reference). I have doubt at one specific point in this derivation. The derivation goes as follows:

Notation:

  1. The subscript $k$ denotes the output layer
  2. The subscript $j$ denotes the hidden layer
  3. The subscript $i$ denotes the input layer
  4. $w_{kj}$ denotes a weight from the hidden to the output layer
  5. $w_{ji}$ denotes a weight from the input to the hidden layer
  6. $a$ denotes an activation value
  7. $t$ denotes a target value
  8. $net$ denotes the net input

The total error in a network is given by the following equation $$E=\frac12 \sum_{k}(t_k-a_k)^2$$ We want to adjust the network’s weights to reduce this overall error: $$\Delta W \propto -\frac{\partial E}{\partial W}$$ We will begin at the output layer with a particular weight. $$\Delta w_{kj} \propto -\frac{\partial E}{\partial w_{kj}}$$ However error is not directly a function of a weight. We expand this as follows. $$\Delta w_{kj} = -\epsilon \frac{\partial E}{\partial a_k} \frac{\partial a_k}{\partial net_k} \frac{\partial net_k}{\partial w_{kj}}$$ Now the document individually calculates above three partial derivatives to get following final equation for weight change rule for a hidden to output weight [Section 4.4 in the attached document]: $$\Delta w_{kj} = \epsilon (t_k-a_k)a_k(1-a_k)a_j$$ or, $$\Delta w_{kj} = \epsilon \delta_k a_j$$ I followed till above equation.

Now, the document continues the derivation as follows (I am adding exact wordings from the document) [Section 4.5 in the document attached]

Weight change rule for an input to hidden weight

Now we have to determine the appropriate weight change for an input to hidden weight. This is more complicated because it depends on the error at all of the nodes this weighted connection can lead to. $$\Delta w_{ji} \propto -[\sum_{k}\frac{\partial E}{\partial a_k} \frac{\partial a_k}{\partial net_k} \frac{\partial net_k}{\partial a_j}]\frac{\partial a_j}{\partial net_j}\frac{\partial net_j}{\partial w_{ji}}$$

I couldn't follow that why there is this extra $\sum_{k}$ in above equation as we have already considered sum over all $k$ when we wrote error $E=\frac12 \sum_{k}(t_k-a_k)^2$ Any lights?

Topic derivation backpropagation deep-learning neural-network machine-learning

Category Data Science


For a while I thought you were right, but the equation is correct.

If you think about it conceptually, then I think you will see that it has to be correct. When updating $w_{ij}$, you have to make the change that improves the total error $E$, and not just the error $E_k$ of output node $k$, right? So you have to be looking for the weights that give you the lowest $E$, that is, where $\partial E / \partial a_k$ is 0.

If you write out the sum over $k$ in the equation for $E$, you get one term for $a_1$, one for $a_2$, and so on. The sum over $k$ in the equation that confuses you, just makes sure to first get the $a_1$ term from $E$, then the $a_2$ term, and so in. It first differentiates $E$ to $a_1$, which is $t_1 + a_1$ if I did that correctly, and gets the correct tion for that. Then for $a_2$, until the end, and finally you have to sum over $k$ to add up the corrections from all terms.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.