Precision, Recall and/or F1? Which should I use? or something different?

I am trying to use tensorflow to predict a decision based on a timeseries dataset.

I have three classes: Wait, Fowards, Backwards

The dataset is high imbalanced ~90% of the time the decision is to Wait. Thus using accuracy as a metric is not useful.

I need my model to focus on correctly identifying a pattern that is either Fowards or Backwards, and so I have implemented the following metric to look at Precision and Recall of the classes I deem relevant.

metrics=[tf.keras.metrics.Recall(class_id=1, name='Bkwd_R'),tf.keras.metrics.Recall(class_id=2, name='Fwd_R'),tf.keras.metrics.Precision(class_id=1, name='Bkwd_P'),tf.keras.metrics.Precision(class_id=2, name='Fwd_P')]

On the understanding that they calculate per class.

Precision = TP/TP+FP

Recall = TP/TP+FN

I know the formula for F1 but I don't really understand what it is representing, so I am not sure if I should use this?...

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

or should I be using some other type of metric?

For my predicitons, the focus is to correctly identify Fowards or Backwards amongst the noise of Waits.

It would be costly to incorrectly identify Backwards as Fowards or the other way around, but not so costly to have either identified as Waits, or Waits identified as either of the other two.

Topic keras tensorflow time-series python machine-learning

Category Data Science


I know the formula for F1 but I don't really understand what it is representing, so I am not sure if I should use this?...

The F1 score is the harmonic mean of precision and recall. In a 2 class problem it is straightforward to calculate since you assign one of the two classes as positive and the other one as negative. Based on that you can calculate precision and recall and then the F1 score. In multiclass problems there are different approaches to calculate the F1 score which I briefly described in this answer.

or should I be using some other type of metric? For my predicitons, the focus is to correctly identify "Fowards" or "Backwards" amongst the noise of "Wait"s.

Due to the imbalanced misclassification cost I would not recommend using an F1 score across all 3 classes. Instead, you could either measure precision and recall for "Fowards" or "Backwards". Or, alternatively, define an F1 score just for these 2 classes in order to get an overall measure.

Generally, IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS by Hoens and Chawla might be worth reading to better understand measures for imbalanced datasets and what's commonly used.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.