ROC-AUC Imbalanced Data Score Interpretation

I have a binary response variable (label) in a dataset with around 50,000 observations.

The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67.

The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 it would also achieve a ROC-AUC score of 0.67. So does this indicate the model isn't doing better than chance at 0.67?

Topic binary-classification xgboost roc class-imbalance

Category Data Science


if the model just guessed =0 it would also achieve a ROC-AUC score of 0.67.

This is incorrect. The ROC curve is defined by varying a decision threshold, and so requires a probability or other confidence measure, not just a hard prediction. Guessing all points as a single class is represented in ROC space as the top-right or bottom-left corner, and doesn't give much information about the AUC.

One interpretation of AUROC is "the probability that, given a random positive instance and a random negative instance, the predicted probability (or confidence) of the positive instance is higher than that of the negative". Stated that way, it's clear that a classifier that predicts random probabilities for every instance will have 0.5 AUC, regardless of class balance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.