Determining threshold in an area with very few samples of positive label

I have a binary classification task where I want to either keep or discard samples. I have about a million samples, and about 1% should be kept. I want to discard as much as possible, but discarding the wrong sample carries a heavy penalty. I have concluded I want to optimize something like the following:

n_discards - n_false_discards * penalty

Where I expect penalty to be around 5000.

Now, it's easy enough to look at my validation data (around 100k samples) and look at where this value is maximized. However, the threshold that this returns is quite unstable. The high penalty means my threshold should be in an area with very few false discards, but the absence of these makes it very difficult for me to accurately estimate where the threshold should be, exactly. This leads to a lot of threshold variance between folds, and more importantly, between what the validation would suggest and what's right for the test set.

I feel like there must be a better way than going on a sample-by-sample basis, but I haven't found the right way. I've considered trying to fit distributions from scipy.stats and using weighted cdf's to optimize the above function, but I haven't been able to get these fits to work very well. I've looked into sklearn.isotonic, and I've tried some heuristics myself (linear probability-based interpolation in between the false discards to approximate expected number of false discards at that probability). So far I'm not really satisfied with any outcome.

In case it's relevant: the model I use is xgboost.

I feel like this should be a rather common dilemma, but I haven't been able to find any robust methodology to deal with this. What is a good approach for dealing with this issue?

Topic probability-calibration xgboost cross-validation class-imbalance python

Category Data Science


First I'd consider resampling the data, if it's not that hyper-dimensional. Combination of SMOTE and Tomek-Links should work fine as a starter.

There's a lib with some methods and nice docs.


You need a cost sensitive learning mechanism for these type of tasks. You basically create a cost matrix that determines the cost of getting predictions wrong. For example while predicting whether a patient has cancer or not, the cost of predicting he doesn't have a cancer while he does is much more than predicting than he has a cancer and actually doesn't. Here's a good article : https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html

Here's one for python : https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.