GridSearch on imbalanced datasets

Question

GridSearch on imbalanced datasets

Valentin

2022年5月23日 06:06

Im trying to use gridsearch to find the best parameter for my model. Knowing that I have to implement nearmiss undersampling method while doing cross validation, should I fit my gridsearch on my undersampled dataset (no matter which under sampling techniques) or on my entire training data (whole dataset) before using cross validation?

Topic hyperparameter-tuning imbalance scikit-learn machine-learning

Category Data Science

Vito Anania · Accepted Answer · 2022年5月23日 06:06

You can create the grid search - cross validation manually instead of using GridsSearchCV and for each split upsample the rarest class only for the folds used for training. The fold used for validation stays the same. This because if you upsample all the training set before the cross validation, you will have a too optimistic result for the validation error. In both the training portion and validation portion of the split there would be the same duplicated instances (and possibly plenty of them)

Noah Weber · Accepted Answer · 2021年2月16日 08:36

Do grid search on the same Level of "imbalancedeness" that you plan/are able to do your Training and Evaluation on.

So that means that if you saw that imbalanced data set does not skew your model predictions or results in other unwanted Outcomes, done use the maximal dataset possible. But on the other Hand if your model is strongly overfitting because of imbalanced dataset then optimisation with grid search will make him overfit more in that direction.

GridSearch on imbalanced datasets

About