Should I encode the categorical data before making a training validation split?

Question

Should I encode the categorical data before making a training validation split?

blueglass

2022年4月16日 06:10

I am looking at some examples in kaggle and I'm not sure what is the correct approach. If I split the training data for training and validation and only encode the categorical data in the training part sometimes there are some unique values that are left behind and I'm not sure if that is correct.

Topic encoding

Category Data Science

Adam · Accepted Answer · 2022年4月16日 06:10

1

Adam answered at 2022年4月16日 06:10

Yes encode the data before the split. The point of the split is to try to represent two i.i.d. samples from the data generating process. Encoding the data simply represents the data in a different manner.

Should I encode the categorical data before making a training validation split?

About