GridSearch on imbalanced datasets

Im trying to use gridsearch to find the best parameter for my model. Knowing that I have to implement nearmiss undersampling method while doing cross validation, should I fit my gridsearch on my undersampled dataset (no matter which under sampling techniques) or on my entire training data (whole dataset) before using cross validation?

Topic hyperparameter-tuning imbalance scikit-learn machine-learning

Category Data Science


You can create the grid search - cross validation manually instead of using GridsSearchCV and for each split upsample the rarest class only for the folds used for training. The fold used for validation stays the same. This because if you upsample all the training set before the cross validation, you will have a too optimistic result for the validation error. In both the training portion and validation portion of the split there would be the same duplicated instances (and possibly plenty of them)


Do grid search on the same Level of "imbalancedeness" that you plan/are able to do your Training and Evaluation on.

So that means that if you saw that imbalanced data set does not skew your model predictions or results in other unwanted Outcomes, done use the maximal dataset possible. But on the other Hand if your model is strongly overfitting because of imbalanced dataset then optimisation with grid search will make him overfit more in that direction.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.