Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets

Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html

There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage.

Here I copy the code used in the notebook linked by the article:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45)
rf = RandomForestClassifier(n_estimators=100, random_state=13)


imba_pipeline = make_pipeline(SMOTE(random_state=42), 
                              RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)

new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
                         return_train_score=True)
grid_imba.fit(X_train, y_train);

grid_imba.best_params_
grid_imba.best_score_

I do not understand why using the pipeline avoids the problem of building a validation set that, due to a possible oversampling before splitting, might cause memory leakage problems.

Secondy, can we use the pipeline for the same purpose but using the original data-set X,y as parameters for the arguments of the functions in the pipeline? So in this way:

imba_pipeline = make_pipeline(SMOTE(random_state=42), 
                              RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X, y, scoring='recall', cv=kf)

new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
                         return_train_score=True)
grid_imba.fit(X, y);

grid_imba.best_params_
grid_imba.best_score_

Topic oversampling pipelines imbalanced-learn methodology class-imbalance

Category Data Science


The page linked already gives a really good explanation:

To see why this is an issue, consider the simplest method of over-sampling (namely, copying the data point). Let's say every data point from the minority class is copied 6 times before making the splits. If we did a 3-fold validation, each fold has (on average) 2 copies of each point! If our classifier overfits by memorizing its training set, it should be able to get a perfect score on the validation set! Our cross-validation will choose the model that overfits the most. We see that CV chose the deepest trees it could!

The main point is that you need to make sure that the data is first split into train/test by using the cross-validation before applying any upsampling methods. This can be achieved by using a pipeline since it makes sure that the upsampling and model fitting are combined and are applied only after the data is split for cross-validation. It would still be possible to do this without a pipeline by splitting the two steps and applying them manually but you'd have to do the cross-validation splitting yourself by using a using indices from sklearn.model_selection.KFold (see the example provided) since cross_val_score and GridSearchCV have no option for this and only allow a single estimator (which itself can also be a pipeline of multiple steps).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.