Can Precision-Recall be improved for imbalanced sample?

I tried out a few models on a highly imbalanced sample (~2:100) where I can get decent AUC from ROC (test sample). But when I plot precision-recall (test sample), it looks horrible. Kind of like the worse PR curve in box (d).

This article contains the picture from below and describes that ROC is better suited since it is invariant to class distribution.

My question is if there's anything than can be done to improve precision-recall?

Topic metric predictive-modeling machine-learning

Category Data Science


Probably yes. If you use logistic regression with L2, change C from 1 (default in scikit-learn) to the power of 10. E.g. C from 1 to 10, 100, 1000...

C is the inverse of regularization strength. By increasing C you decrease bias and increase variance. In other words, you are probably going to overfit.


I am not sure if you are asking how to improve the power of the models or how to better represent the effectiveness of the models on the minority class. These are information retrieval measures that are sometimes adequate in other circumstances. If you are working on an information retrieval process, this is probably fine. If it is more of a statistical analysis you are working on, perhaps sensitivity (which is the same as recall) and specificity would be better metrics to describe your results. Maybe your model is useful in that it can rule out the negative. These metrics can also be evaluated relative to the prior probability of the minor class in the data.

Sensitivity and specificity have an additional benefit if you are trying to communicate your results with an audience that reviews a lot of medical test results as these measures are used most frequently in that field.

HTH

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.