What metrics work well in unbalanced assemblies?

I wanted to know if there are some metrics that work well when working with an unbalanced dataset. I know that accuracy is a very bad metric when evaluating a classifier when the data is unbalanced but, what about for example the Kappa index?

Best regards and thanks.

Topic metric evaluation

Category Data Science


Here is the answer I gave on the stats SE

The choice of metric depends on the needs of the application, not the problems with the methods/tools.

Accuracy is not a very bad metric; the main problem is that practitioners fail to use the relative class frequencies to calibrate their expectations. If 95% of the data belong to the majority class and you get 94% accuracy then of course that isn't very impressive. One way to get around this is to look at accuracy gain, something like

$$\frac{Accuracy - \pi}{1 - \pi}$$

where $\pi$ is the relative frequency of the majority class. If you achieve perfect performance you get a score of 1 - if you do as well as the majority classifier you get a score of 0 (indicating that your model has probably learned nothing of interest by looking at the attributes). In the example above, you would get a negative score, indicating that the classifier is useless. Now this is an affine transformation of accuracy, so it is still measuring exactly the same thing, just on a more interpretable scale.

Imbalanced problems often have unequal misclassification costs, with false-negatives usually being more costly than false-positives, in which case you should probably look at the expected loss of the classifier instead of the accuracy. Again, this means focussing on the needs of the application rather than the methods.

However, for this sort of problem you should use a probabilistic classifier, such as [kernel] logistic regression, so you should look at metrics that measure the quality of the predictions of probability, such as the cross-entropy or Brier score. Probabilistic classifiers are likely to be better as you can experiment with misclassification costs without refitting the model (and do things like implement a rejection operator). When you have done that as a baseline, then perhaps experiment with non-probabilistic classifiers to see if they have benefits.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.