Reduce multiclass classification targets to binary classification targets in scikit-learn

Question

Reduce multiclass classification targets to binary classification targets in scikit-learn

Brian Spiering

2022年5月18日 23:00

I would like to reduce multiclass classification targets to binary classification targets. Ideally, this mapping would happen within scikit-learn so the same transformation applies during both training and prediction.

I looked at transforming the prediction target (y) documentation but did not see something that would work. Ideally, it would be a classifier version of TransformedTargetRegressor.

Something like this mapping:

targets_multi  = {'A', 'B', 'C', 'D'}
targets_binary = {0: {'A', 'B'},
                  1: {'C', 'D'}}

Topic multiclass-classification scikit-learn binary classification

Category Data Science

Ben Reiniger · Accepted Answer · 2021年8月13日 23:10

Of the three stated purposes of pipelines, you'd get the "convenience and encapsulation" one, but not the others:

Joint parameter selection: you don't have any parameters for this transformation.
Safety (from data leak): your transformation is context-specific, so there is no data leakage in applying it to the entire dataset up front.

This feels like something that is the definition of the targets, and is best considered a part of the data retrieval.

scikit-learn expects transform methods to have input just X and not y. For the most part, you can work around that by overriding fit_transform from TransformerMixin. However, nothing downstream will expect to get two return values (transformed X and y), so this won't work.

You can make a little more headway with the imbalanced-learn package, which provides its own Pipeline with more flexible transformation syntax. The purpose there is to implement resamplers, and that throws a major issue: resamplers do not apply at prediction time.

Reduce multiclass classification targets to binary classification targets in scikit-learn

About