Dynamic creation of sklearn pipeline

Question

Dynamic creation of sklearn pipeline

lazarea

2022年4月28日 16:03

I am trying to create an automatic pipeline builder functionality that takes into account a large set of conditions such as the existence of missing values, the scale of numerical features, etc., and automatically creates a scikit-learn pipeline instead of having to manually create them every time.

I'm aware of pipeline.steps.append() functionality that allows to assign new pipeline steps dynamically. However it seems to be not allowed to initialize an empty pipeline to start appending to; doing the following yields an error:

from sklearn.pipeline import Pipeline
pipe = Pipeline([])

This returns ValueError: not enough values to unpack (expected 2, got 0).

Additionally, I also tried passing if conditions directly to pipeline steps the following way, again without success:

pipe = Pipeline([
    ('numerical_scaler', StandardScaler(), num_columns_to_scale) if num_columns_to_scale,
    ('categorical_encoder', OneHotEncoder(), cat_columns_to_encode) if cat_columns_to_encode
])

This returns SyntaxError: invalid syntax.

What would be the best way to create such auto-pipelining functionality? As a dirty workaround I could obviously create a huge collection of if-else conditions to create pipelines that way but that is particularly error prone and difficult to maintain.

Edit: Point of this auto pipeline functionality is to speed up the creation of custom pipelines. Ideally I want to input a dataset, specify the target(s) and let the algorithm create a custom pipeline for the given dataset.

Topic pipelines preprocessing scikit-learn machine-learning

Category Data Science

Multivac · Accepted Answer · 2022年2月25日 15:51

This is still unclear what and why you wanted to do something like what you describe, if you add more context I will try to help.

You could solve the second point with column_transformer

from sklearn.pipeline import. Pipeline
from sklearn.compose import make_column_transformer, make_column_selector as selector


numeric_transformer = Pipeline([("imputer", SimpleImputer(strategy= "median")),
                                ("binning", KBinsDiscretizer(encode = "onehot-dense", strategy= "kmeans"))])

categorical_transformer = Pipeline([("Imputer", SimpleImputer(strategy= "constant", fill_value = "missing")),
                                ("encoding", OneHotEncoder(handle_unknown = "ignore"))])

preprocessor = make_column_transformer((numeric_transformer, selector(dtype_exclude = "object")), (categorical_transformer, selector(dtype_include = "object")))

pipeline = Pipeline([("proprocessing",preprocessor), ("model",LogisticRegression())])

Dynamic creation of sklearn pipeline

About