Cross-entropy loss explanation

Suppose I build a neural network for classification. The last layer is a dense layer with Softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?

Topic softmax deep-learning neural-network machine-learning

Category Data Science


I would like to add a couple of dimensions to the above answers:

true label = [1 0 0 0 0]
predicted = [0.1 0.5 0.1 0.1 0.2]

cross-entropy(CE) boils down to taking the log of the lone +ve prediction. So CE = -ln(0.1) which is = 2.3.

This means that the -ve predictions dont have a role to play in calculating CE. This is by intention.

On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So:

true labels = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]

Now we proceed to compute 5 different cross entropies - one for each of the above 5 true label/predicted combo and sum them up. Then:

CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)] = 3.4

The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values.

All frameworks by default use the first definition of CE and this is the right approach in 99% of the cases. However if your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.

The last situation could be a multi-label one. What if multiple classes 'could' be present in a single sample - something like -

true label = [1 0 0 0 1]
and predicted is = [0.1 0.5 0.1 0.1 0.9]

By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. How do we handle this?

Solution: Firstly, in multi-label problem there are more than a single '1' in the output. So. we should get rid of the softmax and bring in sigmoids - one each for every neuron in the last layer (note that number of neurons = num of classes). Secondly, we use the above approach to calculate loss - wherein we break the expected and predicted values into 5 individual probability distributions of:

true labels = [1,0], [0,1], [0,1], [0,1], [1,0]
predicted = [.1, .9], [.5, .5], [.1, .9], [.1, .9], [.9, .1]

Now just like before, we proceed to take the cross entropy of the above 5 true labels and the 5 predicted probability distributions and sum them up. Then:

CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3

Occasionally, the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the true label is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]

In this case the CE =

- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]

= 0.44 (for the +ve classes) +  105 (for the negative classes)

You can see how the -ve classes are beginning to create a nuisance when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.

One option is to drown the voice of the -ve classes by a multiplier. Facebook did that and much more in a paper they came up with in 2018 and you can refer to focal loss for more details

For a more in-depth treatment of this subject, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0


The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.

I have found a very nice description here where the author shows that the actual derivative is $p_i - y_i$.

Other neat description can be found here.

I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.


The cross entropy formula takes in two distributions, $p(x)$, the true distribution, and $q(x)$, the estimated distribution, defined over the discrete variable $x$ and is given by

$$H(p,q) = -\sum_{\forall x} p(x) \log(q(x))$$

For a neural network, the calculation is independent of the following:

  • What kind of layer was used.

  • What kind of activation was used - although many activations will not be compatible with the calculation because their outputs are not interpretable as probabilities (i.e., their outputs are negative, greater than 1, or do not sum to 1). Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.

For a neural network, you will usually see the equation written in a form where $\mathbf{y}$ is the ground truth vector and $\mathbf{\hat{y}}$ (or some other value taken direct from the last layer output) is the estimate. For a single example, it would look like this:

$$L = - \mathbf{y} \cdot \log(\mathbf{\hat{y}})$$

where $\cdot$ is the inner product.

Your example ground truth $\mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $\mathbf{\hat{y}}$

$L = -(1\times log(0.1) + 0 \times \log(0.5) + ...)$

$L = - log(0.1) \approx 2.303$

An important point from comments

That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?

Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.

You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:

$$J = - \frac{1}{N}\left(\sum_{i=1}^{N} \mathbf{y_i} \cdot \log(\mathbf{\hat{y}_i})\right)$$

Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.


Let's start with understanding entropy in information theory: Suppose you want to communicate a string of alphabets "aaaaaaaa". You could easily do that as 8*"a". Now take another string "jteikfqa". Is there a compressed way of communicating this string? There isn't is there. We can say that the entropy of the 2nd string is more as, to communicate it, we need more "bits" of information.

This analogy applies to probabilities as well. If you have a set of items, fruits for example, the binary encoding of those fruits would be $log_2(n)$ where n is the number of fruits. For 8 fruits you need 3 bits, and so on. Another way of looking at this is that given the probability of someone selecting a fruit at random is 1/8, the uncertainty reduction if a fruit is selected is $-\log_{2}(1/8)$ which is 3. More specifically,

$$-\sum_{i=1}^{8}\frac{1}{8}\log_{2}(\frac{1}{8}) = 3$$ This entropy tells us about the uncertainty involved with certain probability distributions; the more uncertainty/variation in a probability distribution, the larger is the entropy (e.g. for 1024 fruits, it would be 10).

In "cross"-entropy, as the name suggests, we focus on the number of bits required to explain the difference in two different probability distributions. The best case scenario is that both distributions are identical, in which case the least amount of bits are required i.e. simple entropy. In mathematical terms,

$$H(\bf{y},\bf{\hat{y}}) = -\sum_{i}\bf{y}_i\log_{e}(\bf{\hat{y}}_i)$$

Where $\bf{\hat{y}}$ is the predicted probability vector (Softmax output), and $\bf{y}$ is the ground-truth vector( e.g. one-hot). The reason we use natural log is because it is easy to differentiate (ref. calculating gradients) and the reason we do not take log of ground truth vector is because it contains a lot of 0's which simplify the summation.

Bottom line: In layman terms, one could think of cross-entropy as the distance between two probability distributions in terms of the amount of information (bits) needed to explain that distance. It is a neat way of defining a loss which goes down as the probability vectors get closer to one another.


Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by

$$ H(p,q) = -\sum_{i=1}^n p(x_i) \log(q(x_i)) = -(p(x_1)\log(q(x_1)) + \ldots + p(x_n)\log(q(x_n)) $$

Going from here.. we would like to know the derivative with respect to some $x_i$: $$ \frac{\partial}{\partial x_i} H(p,q) = -\frac{\partial}{\partial x_i} p(x_i)\log(q(x_i)). $$ Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to $$ \frac{\partial}{\partial x_i} H(p,q) = -p(x_i)\frac{1}{q(x_i)}\frac{\partial q(x_i)}{\partial x_i}. $$

From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.

I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.


The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.