What's the order in applying SMOTE transformation in a pipeline?

Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values.

This is my code:

from sklearn.pipeline import Pipeline

# SELECAO DE VARIAVEIS
categorical_features = [
    MARRIED,
    RACE
]

continuous_features = [
    AGE,
    SALARY
]

features = [
    MARRIED,
    RACE,
    AGE,
    SALARY
]


# PIPELINE
continuous_transformer = Pipeline(
    steps=[
        (imputer, SimpleImputer(strategy=most_frequent)),
        (scaler, StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        (imputer, SimpleImputer(strategy=median)),
        (onehot, OneHotEncoder(handle_unknown=ignore)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        (num, continuous_transformer, continuous_features),
        (cat, categorical_transformer, categorical_features),
    ]
)

pipeline = Pipeline(
    steps=[(preprocessor, preprocessor), (classifier, LogisticRegression())]
)

X = df[features]
y = df[['binary_response']]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=42
)

X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

pipeline.fit(X_train_smote, y_train_smote)

That doesn't work because I have missing data. But I'm not sure what to do because of the pipeline and the order I should use.

Any thoughts on that?

Topic smote sampling logistic-regression python predictive-modeling

Category Data Science


Resampling should happen after preprocessing but before classier.

It is best to use the imblearn's Pipeline, instead of scikit-learn's Pipeline. Imblearn's Pipeline is designed to work with resampling.

from imblearn.pipeline import Pipeline

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.