Binary classification problem with imbalanced dataset, how to compare to random classifier

Question

Binary classification problem with imbalanced dataset, how to compare to random classifier

Chicoscience

2021年8月25日 13:58

We have a very imbalanced dataset (2% of class 1). To the best of our knowledge, there is no baseline in the literature to the problem we want to solve - so we thought of comparing our performance to a random classifier. We evaluate our model as a combination of precision and recall - we vary the threshold at which data points are classified as 1 and compute the rolling threshold and recall. We could use F1-score as well.

What would be an acceptable way to define a random predictor that we can compare to our model such that the comparison is as fair as possible?

Topic prediction class-imbalance binary

Category Data Science

Dave · Accepted Answer · 2021年8月25日 13:58

You have $98\%$ in one class, right? This means that, knowing nothing about the data, you should be able to get $98\%$ of them right by guessing that majority class. If you get $97\%$ of them right, that sounds like an $\text{A}$ in school and thus a good model, but the model does worse than randomly guessing!

Better yet, compare using proper scoring rules like log loss (crossentropy) or Brier score, against a model that always predicts the prior probability of $P(y=1) = 0.02$. This is analogous to how $R^2$ works in linear regression, by always guessing the mean of the $y$ variable. In your case, the mean of the $y$ variable is the class ratio. If you can't beat the model that always guesses $P(y=1) = 0.02$, perhaps you have a poor model. (Specifics would depend on the misclassification costs, which you might or might not know.)

$$\text{Log Loss}\\ L(y, \hat y) = -\frac{1}{N}\sum_{i = 1}^N \bigg( y_i\log(\hat y_i) + (1 - y_i)\log(1 - \hat y_i) \bigg)$$ $$ \text{Brier Score}\\ L(y, \hat y) = \frac{1}{N}\sum_{i = 1}^N \bigg(y_i - \hat y_i\bigg)^2 $$

This assumes your $y_i\in\{0, 1\}$. If you use $y_i\in\{-1. 1\}$, you would have to modify the loss functions or change how you label your categories. The $\hat y_i$ values are probabilities. There are issues with the log loss if you predict a probability of $0$ or $1$. Some see this as an upside of log loss, while others see it as a downside.

This kind of evaluation of the probability outputs is why statisticians do not see class imbalance as an issue.

Ben Reiniger · Accepted Answer · 2020年3月29日 19:18

Since you are interested in different decision thresholds, your random model should produce scores. In that case, a reasonable base-line model assigns a score uniformly at random in $[0,1]$. Such a model will, at threshold $t$, have

$$\begin{align*} \operatorname{precision} &= \frac{2\%\cdot N\cdot (1-t)}{N(1-t)} = 0.02,\\[1em] \operatorname{recall} &= \frac{2\%\cdot N\cdot (1-t)}{2\%\cdot N} = 1-t. \end{align*} $$

(Perhaps a very simple model will serve as a better baseline.)

Binary classification problem with imbalanced dataset, how to compare to random classifier

About