I would like to add a couple of dimensions to the above answers:
true label = [1 0 0 0 0]
predicted = [0.1 0.5 0.1 0.1 0.2]
cross-entropy(CE) boils down to taking the log of the lone +ve prediction. So CE = -ln(0.1) which is = 2.3.
This means that the -ve predictions dont have a role to play in calculating CE. This is by intention.
On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So:
true labels = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]
Now we proceed to compute 5 different cross entropies - one for each of the above 5 true label/predicted combo and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)] = 3.4
The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values.
All frameworks by default use the first definition of CE and this is the right approach in 99% of the cases. However if your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.
The last situation could be a multi-label one. What if multiple classes 'could' be present in a single sample - something like -
true label = [1 0 0 0 1]
and predicted is = [0.1 0.5 0.1 0.1 0.9]
By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. How do we handle this?
Solution: Firstly, in multi-label problem there are more than a single '1' in the output. So. we should get rid of the softmax and bring in sigmoids - one each for every neuron in the last layer (note that number of neurons = num of classes). Secondly, we use the above approach to calculate loss - wherein we break the expected and predicted values into 5 individual probability distributions of:
true labels = [1,0], [0,1], [0,1], [0,1], [1,0]
predicted = [.1, .9], [.5, .5], [.1, .9], [.1, .9], [.9, .1]
Now just like before, we proceed to take the cross entropy of the above 5 true labels and the 5 predicted probability distributions and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3
Occasionally, the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the true label is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]
In this case the CE =
- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]
= 0.44 (for the +ve classes) + 105 (for the negative classes)
You can see how the -ve classes are beginning to create a nuisance when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.
One option is to drown the voice of the -ve classes by a multiplier. Facebook did that and much more in a paper they came up with in 2018 and you can refer to focal loss for more details
For a more in-depth treatment of this subject, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0