Reduce multiclass classification targets to binary classification targets in scikit-learn

I would like to reduce multiclass classification targets to binary classification targets. Ideally, this mapping would happen within scikit-learn so the same transformation applies during both training and prediction.

I looked at transforming the prediction target (y) documentation but did not see something that would work. Ideally, it would be a classifier version of TransformedTargetRegressor.

Something like this mapping:

targets_multi  = {'A', 'B', 'C', 'D'}
targets_binary = {0: {'A', 'B'},
                  1: {'C', 'D'}}

Topic multiclass-classification scikit-learn binary classification

Category Data Science


Of the three stated purposes of pipelines, you'd get the "convenience and encapsulation" one, but not the others:

  • Joint parameter selection: you don't have any parameters for this transformation.
  • Safety (from data leak): your transformation is context-specific, so there is no data leakage in applying it to the entire dataset up front.

This feels like something that is the definition of the targets, and is best considered a part of the data retrieval.


scikit-learn expects transform methods to have input just X and not y. For the most part, you can work around that by overriding fit_transform from TransformerMixin. However, nothing downstream will expect to get two return values (transformed X and y), so this won't work.

You can make a little more headway with the imbalanced-learn package, which provides its own Pipeline with more flexible transformation syntax. The purpose there is to implement resamplers, and that throws a major issue: resamplers do not apply at prediction time.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.