How to use SMOTE in Stacking in SKLearn?
I have a data set X,y
and split them to train and test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=10)
. To handle imbalanced data, I wanna use SMOTE and then use classification algorithms. However, I am going to use Stacking as my classification method. I would be thankful to know when I should use SMOTE? Should I use them in defining lower-level classifiers or in higher-level classifiers?
level0 = list() oversample = SMOTE() RF = RandomForestClassifier(random_state=13) pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', RF)]) level0.append(pipe1 ) DT = DecisionTreeClassifier( random_state=0) pipe2 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', DT)]) level0.append(pipe2) level1 = LogisticRegression model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True) model.fit(X_train, y_train) model.predict(X_test)
Or I should use the following code?
level0 = list() oversample = SMOTE() RF = RandomForestClassifier(random_state=13) level0.append(RF) DT = DecisionTreeClassifier( random_state=0) level0.append(DT) level1 = LogisticRegression model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True) pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', model)]) pipe1.fit(X_train, y_train) pipe1.predict(X_test)
Another question, we use SMOTE in the training step to have a better model. But in pipeline, the first step is using SMOTE, and I think that in prediction on test data, at first, test data is oversampled, then classification model is applied? Is it correct? I don't know how I should use SMOTE for the final prediction. I would be thankful if someone can explain it and modify my code.
Topic pipelines stacking smote scikit-learn classification
Category Data Science