How to combine preprocessor/estimator selection with hyperparameter tuning using sklearn pipelines?

I'm aware of how to use sklearn.pipeline.Pipeline() for simple and slightly more complicated use cases alike. I know how to set up pipelines for homogeneous as well as heterogeneous data, in the latter case making use of sklearn.compose.ColumnTransformer().

Yet, in practical ML one must oftentimes not only experiment with a large set of model hyperparameters, but also with a large set of potential preprocessor classes and different estimators/models.

My question is a dual one:

  1. What would be the preferred way to set up a pipeline where the selection of text vectorizers is treated as an additional hyperparameter for grid or randomized search?
  2. Additionally, what would be the preferred way to set up a pipeline where the selection of multiple models can also be treated as an additional hyperparameter? What about optimizing the model-specific hyperparameters in this case?

In the first case a common use case is text vectorization: treating the choice of CountVectorizer() or TfidfVectorizer() a hyperparameter to be optimized.

In the second case a practical use case could be selecting between various algorithms or in the case of multiclass classifications, whether to use OneVsOneClassifier() or OneVsRestClassifier().

I understand that this might exactly be what AutoML solutions have been developed for. I heard of out-of-the-box AutoML solutions in the past that can do automatic model selection with hyperparameter tuning but I have no experience in any of them, thus I don't know if they indeed provide an answer for the general topics I described in this post.

Topic pipelines hyperparameter-tuning python machine-learning

Category Data Science

Some pure scikit approaches:

  • When pre-processing relates to data balancing & sampling strategies, consider using Imbalance-Learn components (ie: RandomUnderSample) you embed right into your pipelines. This lets you hyper tune the parameters.

  • Rely on passthrough functionality of grid search when deciding if certain pre-processing steps are needed at all. This however cannot express when the step is a required pre-condition to another step (ie: StandardScaler + MLPerceptronClassifier)

  • Consider using scikit-opt's BayesSearchCV strategy to walk through the parameter space based on previous runs rather than fixed or randomly like GridSearchCV or RandomizedSearchCV do. For many parameter tuning this may converge faster.

⚠️ In practice I find its often extremely time & computationally expensive to have full end-to-end pipelines that try to learn everything. (ie: parameters, metrics, model types, normalization stages, features, architectures, etc), so hyper tune what matters most.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.