How to handle categorical feature engineering in ML production?

I have a classification dataset ,where I have a lot of categorical columns . I have one hot encoded ie. dummy variables in my training . How to handle this in production side of ML. There are cases in which there is drift in data in future datasets which can introduce new variables outside categories used during training the model.

What I did was after one hot encoding off all the features I saved the categorical columns and saved those as pickle file and later loaded the pickle file to match the production set features during deployment and remove the extras.

How is it done in production , the correct way?

Topic deployment machine-learning

Category Data Science


This seems like reasonable for your problem. But sometimes you also face cases where some categories maybe missing and can lead to error in production. How it handles in production was little different :

  1. Pickle a pandas training dataframe used for production model with single row to ensure same columns

  2. Once you do one hot encoding on production dataframe append it to pickle dataframe so as to ensure same columns as training df. Remove the first row which cam from training. This ensures that no column used during training is missing and extra columns are ignored

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.