Backpropagation with log likelihood cost function and softmax activation

In the online book on neural networks by Michael Nielsen, in chapter 3, he introduces a new cost function called as log-likelihood function defined as below
$$ C = -ln(a_y^L) $$ Suppose we have 10 output neurons, when back propagating the error, only the gradient w.r.t. $y^{th}$ output neuron is non-zero and all others are zero. Is that right?

If so, how is the below equation (81) true? $$\frac{\partial C}{\partial b_j^L} = a_j^L - y_j $$ I'm getting the expression as $$\frac{\partial C}{\partial b_j^L} = y_j (a_j^L - 1) $$

Topic softmax backpropagation neural-network

Category Data Science


No you actually didn't really understand how softmax functions it outputs a probability distribution hence if there are 10 output neurons you will have 10 probabilities for the 10 respective classes i.e. the neuron with the highest probability will be more activated that is none of the output neurons will give 0 as output that is what softmax is it takes exponential average of every class to produce a probability distribution over k different classes here k=10.Now as you said suppose you have 10 output neurons then while back propagating the error, only the gradient w.r.t. yth output neuron is non-zero and all others are zero,this is wrong if you go and give it a read the error or cost function is calculated as follows when there are multiple neurons: enter image description here

Now as you can see the cost is calculated over all the output neurons for all the training examples in the batch that is of size 'n' here hence when you do backpropagation that is when you calculate"dc/dw" and "dc/db" it includes output from all the output neurons and your statement is wrong because if gradients of other output neurons are 0 then how will your backpropagation update there weight matrices. I know its confusing but if you read his ch-2 you should be able to understand it.i have used cross entropy to explain you this but the same method will work for any cost function that you take .

As far as the derivation is considered it is quite easy just go back and study his chapter two all four equation BP1,BP2,BP3,BP4 and understand there derivation it will take some time but it is easy once you understand the composite function nature of neural networks and how to differentiate composite functions using the chain rule.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.