Determining threshold in an area with very few samples of positive label
I have a binary classification task where I want to either keep or discard samples. I have about a million samples, and about 1% should be kept. I want to discard as much as possible, but discarding the wrong sample carries a heavy penalty. I have concluded I want to optimize something like the following:
n_discards - n_false_discards * penalty
Where I expect penalty to be around 5000.
Now, it's easy enough to look at my validation data (around 100k samples) and look at where this value is maximized. However, the threshold that this returns is quite unstable. The high penalty means my threshold should be in an area with very few false discards, but the absence of these makes it very difficult for me to accurately estimate where the threshold should be, exactly. This leads to a lot of threshold variance between folds, and more importantly, between what the validation would suggest and what's right for the test set.
I feel like there must be a better way than going on a sample-by-sample basis, but I haven't found the right way. I've considered trying to fit distributions from scipy.stats and using weighted cdf's to optimize the above function, but I haven't been able to get these fits to work very well. I've looked into sklearn.isotonic, and I've tried some heuristics myself (linear probability-based interpolation in between the false discards to approximate expected number of false discards at that probability). So far I'm not really satisfied with any outcome.
In case it's relevant: the model I use is xgboost.
I feel like this should be a rather common dilemma, but I haven't been able to find any robust methodology to deal with this. What is a good approach for dealing with this issue?
Topic probability-calibration xgboost cross-validation class-imbalance python
Category Data Science