How to handle categorical variables with Random Forest using Scikit Learn?
One of the variables/features is the department id, which is like 1001, 1002, ..., 1218, etc. The ids are nominal, not ordinal, i.e., they are just ids, department 1002 is by no means higher than department 1001. I feed the feature to random forest using Scikit Learn. How should I deal with it?
Some people say to use one-hot encoding. However,
- Some others say the one-hot encoding degrades random forest's performance.
- Also, I do have over 200 departments, so I will add about 200 more variables for using one-hot encoding.
But if I just use the original values, 1001, 1002, etc., will random forest think that department 1002 is higher than department 1001?
Thanks.
Topic categorical-encoding one-hot-encoding random-forest
Category Data Science