Correctly evaluate model with oversampling and cross-validation
I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue.
As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are
- Divide dataset into train-test
- Perform the 5 fold-CV, as...
- Sample the "training" portion of the CV
- Sample the "validating" portion of the CV
- Train the model on the "training"
- Validate it on the "validating"
- Repeat 3-6 5 times
- Evaluate performances on test
My doubt is: how can I compare the CV performances with the test, since the formare are based on sampled data and the latter does not?
An idea is to skip 4 and sample only "training" portion, but in this case how can I compare the "training" with the "validating"?
EDIT: added target ratio.
Topic overfitting sampling cross-validation
Category Data Science