Correctly evaluate model with oversampling and cross-validation

Question

Correctly evaluate model with oversampling and cross-validation

Matteo Felici

2022年4月22日 16:06

I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue.

As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are

Divide dataset into train-test
Perform the 5 fold-CV, as...
Sample the "training" portion of the CV
Sample the "validating" portion of the CV
Train the model on the "training"
Validate it on the "validating"
Repeat 3-6 5 times
Evaluate performances on test

My doubt is: how can I compare the CV performances with the test, since the formare are based on sampled data and the latter does not?

An idea is to skip 4 and sample only "training" portion, but in this case how can I compare the "training" with the "validating"?

EDIT: added target ratio.

Topic overfitting sampling cross-validation

Category Data Science

SUN · Accepted Answer · 2019年11月5日 01:38

I believe, the sequence for combination of CV and SMOTE should be as below.

1. Perform the 5 fold-CV ( Loop through for each fold )
2. Training Sample and Testing Sample ( for each fold )
3. Smote Training Samples
4. Train the model on the "training"
5. Prediction ( test samples )
6. Evaluate performances on test
Repeat for next fold

Correctly evaluate model with oversampling and cross-validation

About