Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets
Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html
There is an explanation of how to use from imblearn.pipeline import make_pipeline
in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage.
Here I copy the code used in the notebook linked by the article:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45)
rf = RandomForestClassifier(n_estimators=100, random_state=13)
imba_pipeline = make_pipeline(SMOTE(random_state=42),
RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True)
grid_imba.fit(X_train, y_train);
grid_imba.best_params_
grid_imba.best_score_
I do not understand why using the pipeline avoids the problem of building a validation set that, due to a possible oversampling before splitting, might cause memory leakage problems.
Secondy, can we use the pipeline for the same purpose but using the original data-set X,y as parameters for the arguments of the functions in the pipeline? So in this way:
imba_pipeline = make_pipeline(SMOTE(random_state=42),
RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X, y, scoring='recall', cv=kf)
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True)
grid_imba.fit(X, y);
grid_imba.best_params_
grid_imba.best_score_
Topic oversampling pipelines imbalanced-learn methodology class-imbalance
Category Data Science