Why is the F-measure preferred for classification tasks?
Why is the F-measure usually used for (supervised) classification tasks, whereas the G-measure (or Fowlkes–Mallows index) is generally used for (unsupervised) clustering tasks?
The F-measure is the harmonic mean of the precision and recall.
The G-measure (or Fowlkes–Mallows index) is the geometric mean of the precision and recall.
Below is a plot of the different means.
F1 (harmonic) $= 2\cdot\frac{precision\cdot recall}{precision + recall}$
Geometric $= \sqrt{precision\cdot recall}$
Arithmetic $= \frac{precision + recall}{2}$
The reason I ask is that I need to decide which average to use in a NLG task, where I measured BLEU and ROUGE ( where BLEU is equivalent to precision and ROUGE to recall). How should I calculate the mean of these scores?
Topic nlg metric scoring evaluation machine-learning
Category Data Science