Optimizing decision threshold on model with oversampled/imbalanced data
I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something.
- Split Train/Test/Val
- Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds)
- Scoring metric will be AUC as training set is balanced at that point
- Since model was trained on balanced dataset, it will probably be very conservative and predict a lot of false positives
- Taking above into consideration, model will be calibrated to have more accurate probabilities (CalibratedClassifierCV)
- View precision/recall curve with calibrated probability thresholds on validation set and determine optimal point
Does this process sound reasonable? Would appreciate any feedback/suggestions
Topic grid-search smote model-selection cross-validation
Category Data Science