Getting dummies for both train and test data
Should I apply pd.get_dummies()
for both train and test data? And would it not result in data leakage?
Topic encoding
Category Data Science
Should I apply pd.get_dummies()
for both train and test data? And would it not result in data leakage?
Topic encoding
Category Data Science
If you use pandas.get_dummies
on the train and test data separately you will likely run into issues because it is likely that there are new values in the test dataset which are not in the training dataset. It is therefore better to use something like sklearn.preprocessing.OneHotEncoder
which can save state and encode the test dataset based on the values that were seen in the training dataset.
Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.