Dynamic creation of sklearn pipeline
I am trying to create an automatic pipeline builder functionality that takes into account a large set of conditions such as the existence of missing values, the scale of numerical features, etc., and automatically creates a scikit-learn pipeline instead of having to manually create them every time.
I'm aware of pipeline.steps.append()
functionality that allows to assign new pipeline steps dynamically. However it seems to be not allowed to initialize an empty pipeline to start appending to; doing the following yields an error:
from sklearn.pipeline import Pipeline
pipe = Pipeline([])
This returns ValueError: not enough values to unpack (expected 2, got 0)
.
Additionally, I also tried passing if conditions directly to pipeline steps the following way, again without success:
pipe = Pipeline([
('numerical_scaler', StandardScaler(), num_columns_to_scale) if num_columns_to_scale,
('categorical_encoder', OneHotEncoder(), cat_columns_to_encode) if cat_columns_to_encode
])
This returns SyntaxError: invalid syntax
.
What would be the best way to create such auto-pipelining functionality? As a dirty workaround I could obviously create a huge collection of if-else conditions to create pipelines that way but that is particularly error prone and difficult to maintain.
Edit: Point of this auto pipeline functionality is to speed up the creation of custom pipelines. Ideally I want to input a dataset, specify the target(s) and let the algorithm create a custom pipeline for the given dataset.
Topic pipelines preprocessing scikit-learn machine-learning
Category Data Science