How to use SMOTE in Stacking in SKLearn?
I have a data set X,y
and split them to train and test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=10)
. To handle imbalanced data, I wanna use SMOTE and then use classification algorithms. However, I am going to use Stacking as my classification method. I would be thankful to know when I should use SMOTE? Should I use them in defining lower-level classifiers or in higher-level classifiers?
level0 = list()
oversample = SMOTE()
RF = RandomForestClassifier(random_state=13)
pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', RF)])
level0.append(pipe1 )
DT = DecisionTreeClassifier( random_state=0)
pipe2 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', DT)])
level0.append(pipe2)
level1 = LogisticRegression
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True)
model.fit(X_train, y_train)
model.predict(X_test)
Or I should use the following code?
level0 = list()
oversample = SMOTE()
RF = RandomForestClassifier(random_state=13)
level0.append(RF)
DT = DecisionTreeClassifier( random_state=0)
level0.append(DT)
level1 = LogisticRegression
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, passthrough = True)
pipe1 = Pipeline(steps=[('OverSampling', oversample ), ('Classifier', model)])
pipe1.fit(X_train, y_train)
pipe1.predict(X_test)
Another question, we use SMOTE in the training step to have a better model. But in pipeline, the first step is using SMOTE, and I think that in prediction on test data, at first, test data is oversampled, then classification model is applied? Is it correct? I don't know how I should use SMOTE for the final prediction. I would be thankful if someone can explain it and modify my code.
Topic pipelines stacking smote scikit-learn classification
Category Data Science