Getting dummies for both train and test data

Should I apply pd.get_dummies() for both train and test data? And would it not result in data leakage?

Topic encoding

Category Data Science


If you use pandas.get_dummies on the train and test data separately you will likely run into issues because it is likely that there are new values in the test dataset which are not in the training dataset. It is therefore better to use something like sklearn.preprocessing.OneHotEncoder which can save state and encode the test dataset based on the values that were seen in the training dataset.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.