sklearn serialize label encoder for multiple categorical columns

I have a model with several categorical features that need to be converted to numeric format. I am using a combination of LabelEncoder and OneHotEncoder to achieve this. Once in production, I need to apply the same encoding to new incoming data before the model can be used. I've saved on disk the model and the encoders using pickle. The problem here is that the LabelEncoder keeps only the last set of classes (for the last feature it has encoded), thus it can't be used to encode all the categorical features for the new data. To face this issue I am saving on disk a different LabelEncoder for each one of the categorical features, but this does not seem to scale very well to me, especially when you have a large number of categorical features.

What is the common practice for this situation? Is it possible to serialize and save just one encoder for all the categorical features to be used in production?

Topic encoder categorical-encoding labels scikit-learn categorical-data

Category Data Science


LabelEncoder is meant for the labels (target, dependent variable), not for the features. OrdinalEncoder can be used for features, and so can take a 2d array rather than the 1d array LabelEncoder requires, and so you can use a single transformer for all your categorical columns. (You can use a ColumnTransformer to select those categorical columns, if you have continuous ones too.)

But, you should not use OrdinalEncoder before OneHotEncoder anymore: OneHotEncoder has for some time now applied directly to string categorical columns.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.