Doubt in Derivation of Backpropagation
I was going through the derivation of backpropagation algorithm provided in this document (adding just for reference). I have doubt at one specific point in this derivation. The derivation goes as follows:
Notation:
- The subscript $k$ denotes the output layer
- The subscript $j$ denotes the hidden layer
- The subscript $i$ denotes the input layer
- $w_{kj}$ denotes a weight from the hidden to the output layer
- $w_{ji}$ denotes a weight from the input to the hidden layer
- $a$ denotes an activation value
- $t$ denotes a target value
- $net$ denotes the net input
The total error in a network is given by the following equation $$E=\frac12 \sum_{k}(t_k-a_k)^2$$ We want to adjust the network’s weights to reduce this overall error: $$\Delta W \propto -\frac{\partial E}{\partial W}$$ We will begin at the output layer with a particular weight. $$\Delta w_{kj} \propto -\frac{\partial E}{\partial w_{kj}}$$ However error is not directly a function of a weight. We expand this as follows. $$\Delta w_{kj} = -\epsilon \frac{\partial E}{\partial a_k} \frac{\partial a_k}{\partial net_k} \frac{\partial net_k}{\partial w_{kj}}$$ Now the document individually calculates above three partial derivatives to get following final equation for weight change rule for a hidden to output weight [Section 4.4 in the attached document]: $$\Delta w_{kj} = \epsilon (t_k-a_k)a_k(1-a_k)a_j$$ or, $$\Delta w_{kj} = \epsilon \delta_k a_j$$ I followed till above equation.
Now, the document continues the derivation as follows (I am adding exact wordings from the document) [Section 4.5 in the document attached]
Weight change rule for an input to hidden weight
Now we have to determine the appropriate weight change for an input to hidden weight. This is more complicated because it depends on the error at all of the nodes this weighted connection can lead to. $$\Delta w_{ji} \propto -[\sum_{k}\frac{\partial E}{\partial a_k} \frac{\partial a_k}{\partial net_k} \frac{\partial net_k}{\partial a_j}]\frac{\partial a_j}{\partial net_j}\frac{\partial net_j}{\partial w_{ji}}$$
I couldn't follow that why there is this extra $\sum_{k}$ in above equation as we have already considered sum over all $k$ when we wrote error $E=\frac12 \sum_{k}(t_k-a_k)^2$ Any lights?
Topic derivation backpropagation deep-learning neural-network machine-learning
Category Data Science