How to use SMOTE in Stacking in SKLearn?

I have a data set X,y and split them to train and test data. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=10). To handle imbalanced data, I wanna use SMOTE and then use classification algorithms. However, I am going to use Stacking as my classification method. I would be thankful to know when I should use SMOTE? Should I use them in defining lower-level classifiers or in higher-level classifiers?

level0 = list()
oversample = SMOTE()
RF = RandomForestClassifier(random_state=13)
pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', RF)])
level0.append(pipe1 )

DT = DecisionTreeClassifier( random_state=0)
pipe2 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', DT)])
level0.append(pipe2)



level1 = LogisticRegression
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True)
model.fit(X_train, y_train)
model.predict(X_test)

Or I should use the following code?

level0 = list()
oversample = SMOTE()
RF = RandomForestClassifier(random_state=13)
level0.append(RF)

DT = DecisionTreeClassifier( random_state=0)
level0.append(DT)

level1 = LogisticRegression
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True)

pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', model)])

pipe1.fit(X_train, y_train)
pipe1.predict(X_test)

Another question, we use SMOTE in the training step to have a better model. But in pipeline, the first step is using SMOTE, and I think that in prediction on test data, at first, test data is oversampled, then classification model is applied? Is it correct? I don't know how I should use SMOTE for the final prediction. I would be thankful if someone can explain it and modify my code.

Topic pipelines stacking smote scikit-learn classification

Category Data Science


First question, whether to use SMOTE for the first or second of a stacked classifiers. Generally, SMOTE should be done before any classification since SMOTE gives the minority class an increased likelihood be being successfully learned. The first classifier should be given the most useful features. Another way to approach is looking for empirical evidence. Train models both ways and choose the ordering that performs betters.

Second question, SMOTE is only done on the training dataset. During prediction, only the data that is present is predicted. If you use imblearn's Pipeline, this automatically handled.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.