How F1 score is good with unbalanced dataset

I have read around on this site that it's recommended to use F1 score if the dataset is imbalanced and if you want to seek a balance between recall and precession. Could you please explain how F1 can be useful in terms of imbalanced dataset?

Topic f1score class-imbalance

Category Data Science


F1-score


The formula for F1-score is:

\begin{align*} F1=2 ∗ \frac{\text{precision∗recall}}{\text{precision+recall}} \end{align*}

F1-score can be interpreted as a weighted average or harmonic mean of precision and recall, where the relative contribution of precision and recall to the F1-score are equal. F1-score reaches its best value at $1$ and worst score at $0$.

What we are trying to achieve with the F1-score metric is to find an equal balance between precision and recall, which is extremely useful in most scenarios when we are working with imbalanced datasets (i.e., a dataset with a non-uniform distribution of class labels).

For example, If we write the two metrics PRE and REC in terms of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), we get:

\begin{align*} PRE = \frac{\text{TP }}{\text{(TP + FP)}} \end{align*}

\begin{align*} REC = \frac{\text{TP }}{\text{ (TP + FN)}} \end{align*}

Thus, the precision score gives us an idea (expressed as a score from 1.0 to 0.0, from good to bad) of the proportion of how many actual spam emails (TP) we correctly classified as spam among all the emails we classified as spam (TP + FP). In contrast, the recall (also ranging from 1.0 to 0.0) tells us about how many of the actual spam emails (TP) we "retrieved" or "recalled" (TP + FN).

When we create a classifier, oftentimes we need to make a compromise between the recall and precision, it is kind of hard to compare a model with high recall and low precision versus a model with high precision but low recall. F1-score merges these two metrics into a single measure that we can use to compare two models. This is not to say that a model with a higher F1-score is always better as it depends on the use case. When using model-based metrics to evaluate an imbalanced classification problem, it is oftentimes recommended to look at the precision and recall score to fully evaluate the overall effectiveness of a model.

A model with high recall but low precision score returns many positive results, but most of its predicted labels are incorrect when compared to the ground truth. On the other hand, a model with high precision but low recall score returns very few results, but most of its predicted labels are correct when compared to the ground-truth. An ideal scenario would be a model with high precision and high recall, meaning it will return many results, with all results labelled correctly. Unfortunately, in most cases, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.