Why is the F-measure preferred for classification tasks?

Why is the F-measure usually used for (supervised) classification tasks, whereas the G-measure (or Fowlkes–Mallows index) is generally used for (unsupervised) clustering tasks?

The F-measure is the harmonic mean of the precision and recall.

The G-measure (or Fowlkes–Mallows index) is the geometric mean of the precision and recall.

Below is a plot of the different means.

F1 (harmonic) $= 2\cdot\frac{precision\cdot recall}{precision + recall}$

Geometric $= \sqrt{precision\cdot recall}$

Arithmetic $= \frac{precision + recall}{2}$

The reason I ask is that I need to decide which average to use in a NLG task, where I measured BLEU and ROUGE ( where BLEU is equivalent to precision and ROUGE to recall). How should I calculate the mean of these scores?

Topic nlg metric scoring evaluation machine-learning

Category Data Science


The scoring function is used as an objective measure of performance. The choice of scoring function itself is subjective and should reflect what you or the problem deems to be important in terms of the balance between whatever metrics you are tracking (e.g., precision & recall, or sensitivity & specificity, or BLEU & ROUGE).

Arithmetic mean, geometric mean, and harmonic mean are all special cases of the generalized means family, which means they're conceptually related. For your task, the arithmetic mean represents no preference between whether BLEU or ROUGE is higher, and where increasing one value and decreasing the other by the same amount makes no difference. The geometric and harmonic means both penalise differences between BLEU and ROUGE, with the harmonic mean being more "pessimistic" than the geometric mean. This can be seen in your plot, where the arithmetic curve sits above the geometric curve, and the harmonic curve is at the bottom. Using the generalized mean, you could have subjectively chosen any curve above the arithmetic curve, below the harmonic curve, or anywhere between these. There is no inherent reason why the harmonic mean or geometric mean is more meaningful, they just have simple formulas. Pick whatever is a closer match for how you value the trade-off between BLEU and ROUGE. Equally, you may decide you don't want to use any of these means based on the generalised mean.


If Precision and Recall are similar, F1 is a good single measure to compare different models.

Short and sweet :)


The Fı-score is preferred to simple classification accuracy in order to counter the problem of imbalanced datasets; if the thing you are looking for occurs only rarely anyway then a naive classifier can always say no and appear to be working very well! A variant on Fı is Fß, where

Fß = (1+ß²) × [ (P × R) ÷ ( (ß² × P) + R ) ]

Vary ß to balance precision and recall. As to the why F or G, I believe it to be empirical - you don't say if you are classifying or clustering in your own application?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.