What's the order in applying SMOTE transformation in a pipeline?
Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values.
This is my code:
from sklearn.pipeline import Pipeline
# SELECAO DE VARIAVEIS
categorical_features = [
MARRIED,
RACE
]
continuous_features = [
AGE,
SALARY
]
features = [
MARRIED,
RACE,
AGE,
SALARY
]
# PIPELINE
continuous_transformer = Pipeline(
steps=[
(imputer, SimpleImputer(strategy=most_frequent)),
(scaler, StandardScaler()),
]
)
categorical_transformer = Pipeline(
steps=[
(imputer, SimpleImputer(strategy=median)),
(onehot, OneHotEncoder(handle_unknown=ignore)),
]
)
preprocessor = ColumnTransformer(
transformers=[
(num, continuous_transformer, continuous_features),
(cat, categorical_transformer, categorical_features),
]
)
pipeline = Pipeline(
steps=[(preprocessor, preprocessor), (classifier, LogisticRegression())]
)
X = df[features]
y = df[['binary_response']]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42
)
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)
pipeline.fit(X_train_smote, y_train_smote)
That doesn't work because I have missing data. But I'm not sure what to do because of the pipeline and the order I should use.
Any thoughts on that?
Topic smote sampling logistic-regression python predictive-modeling
Category Data Science