Correctly evaluate model with oversampling and cross-validation

I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue.

As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are

  1. Divide dataset into train-test
  2. Perform the 5 fold-CV, as...
  3. Sample the "training" portion of the CV
  4. Sample the "validating" portion of the CV
  5. Train the model on the "training"
  6. Validate it on the "validating"
  7. Repeat 3-6 5 times
  8. Evaluate performances on test

My doubt is: how can I compare the CV performances with the test, since the formare are based on sampled data and the latter does not?

An idea is to skip 4 and sample only "training" portion, but in this case how can I compare the "training" with the "validating"?

EDIT: added target ratio.

Topic overfitting sampling cross-validation

Category Data Science


I believe, the sequence for combination of CV and SMOTE should be as below.

1. Perform the 5 fold-CV ( Loop through for each fold )
2. Training Sample and Testing Sample ( for each fold )
3. Smote Training Samples
4. Train the model on the "training"
5. Prediction ( test samples )
6. Evaluate performances on test
Repeat for next fold

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.