Evaluation Metric for Imbalanced and Ordinal Classification

I'm looking for an ML evaluation metric that would work well with imbalanced and ordinal multiclass datasets:

Imagine you want to predict the severity of a disease that has 4 grades of severity where 1 is mild and 4 represent the worse outcome. Now, this dataset would realistically have the vast majority of patients in the mild zone (classes 1 or 2) and fewer in classes 3 and 4. (Imbalanced/skewed dataset).

Now in the example, a classifier that predicts a grade 4 as grade 1, should be penalised more than a classifier that predicts a grade 4 as grade 3 etc. (Ordinal class).

If I use MCC, Cohen's K etc. I will be able to account for the imbalance in the dataset but not for the ordinal nature of its class. Would you know if there is a metric that would account for both or if there is a way to modify/combine metrics so that both aspects of the dataset would be taken into account? (If possible using Python but also other languages or a mathematical explanation would work)

Topic model-evaluations multiclass-classification class-imbalance evaluation scikit-learn

Category Data Science


There are variations of Cohens Kappa that are meant to be applied to ordinal scales. Sklearn cohen_kappa has options for linear and quadratic.


I would consider using a regression evaluation measure such as RMSE:

  • It takes into account the ordinal nature of the values since it's based on the error between the predicted and true value. This means that an error between 2 and 4 is penalized more than an error between 2 and 3 for instance.
  • It would also take into account the imbalance but only to the extent that errors between classes 1,2 and 3,4 are more likely to be large, so it might not be perfect.

If more refinement is needed, a solution would be to define a custom measure of errors, i.e. define the penalty for every possible error. For example:

  • errors 1->2, 2->1, 2->3, 3->4 counted as 1 (small error)
  • errors 1->3, 2->4, 3->2, 4->3 counted as 5 (moderate error)
  • errors 1->4, 4->2, 3->1 counted as 10 (serious error)
  • errors 4->1 counted as 20 (critical error)

Above my notation "x->y" means true value x wrongly predicted as y.

Of course I'm making up these values, it's just to show that one can go as fine-grained as needed depending on the task.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.