Cross validation schema for imbalanced dataset
Based on a previous post, I understand the need to ensure that the validation folds during the CV process have the same imbalanced distribution as the original dataset when training a binary classification model with imbalance dataset. My question is regarding the best training schema.
Let’s assume that I have an imbalanced dataset with 5M samples where 90% are pos class vs 10% neg class, and I am going to use 5-folds CV for model tuning. Also, let’s assume I will hold out a random 100K samples for test (90K samples w/ pos class vs 10K samples w/ neg class). Now I have two options:
Option 1)
- Step 1: Pull a randomly selected 200K imbalanced data for training (180K samples pos class vs 20K samples neg class)
- Step 2: During each CV iteration:
- The training fold will have 160K samples (144K pos vs 16K neg)
- and the validation fold will have 40K samples (36K pos vs 4K neg)
- Step 3: Apply data balancing for the training fold (e.g., Downsampling, Upsampling, SMOTE, etc.) and fit a model
- Step 4: Validate the model on the imbalanced training fold
However, given that I have enough data, I want to avoid using any data balancing algorithm for the training folds.
Option 2)
- Step 1: Pull a randomly selected 200K balanced data for training (100K samples pos class vs 100K samples neg class)
- Step 2: During each CV iteration:
- The training fold will have 160K samples (80K pos vs 80K neg)
- and the validation fold will have 40K samples (20K pos vs 20K neg)
- Step 3: Fit a model for the already balanced training fold
- Step 4: Can I apply down sampling to the balanced validation dataset to restore it to its imbalanced state? If so, how can I do that in sklearn?
I am also clear that I have a 3rd option, which is based on the 1st option above, where the model could be trained on an imbalanced dataset. Therefore, a data balancing algorithm can be avoided.
My questions are:
- Is option 2 better than option 1?
- How to apply a downsampling to a balanced validation dataset (Option 2-step 4
Topic imbalanced-learn cross-validation class-imbalance classification
Category Data Science