How to handle categorical variables with Random Forest using Scikit Learn?

Question

How to handle categorical variables with Random Forest using Scikit Learn?

Fred Chang

2022年3月14日 21:09

One of the variables/features is the department id, which is like 1001, 1002, ..., 1218, etc. The ids are nominal, not ordinal, i.e., they are just ids, department 1002 is by no means higher than department 1001. I feed the feature to random forest using Scikit Learn. How should I deal with it?

Some people say to use one-hot encoding. However,

Some others say the one-hot encoding degrades random forest's performance.
Also, I do have over 200 departments, so I will add about 200 more variables for using one-hot encoding.

But if I just use the original values, 1001, 1002, etc., will random forest think that department 1002 is higher than department 1001?

Thanks.

Topic categorical-encoding one-hot-encoding random-forest

Category Data Science

Erwan · Accepted Answer · 2022年3月14日 21:09

One-hot encoding (OHE) is the standard method to represent a categorical feature.

In my opinion 200 is not high dimensionality, it's very common to use OHE on text data with a much higher number of dimensions. It's important to keep in mind that these are 200 boolean features, they are simpler to model than a single numerical value with a non-standard distribution for instance.

However the question of dimensionality should be seen relatively to the number of instances. In particular it's likely that some of these ids are not frequent enough to provide a sufficiently representative sample. These instances should be discarded, or the id should be replaced with some generic value.

Multivac · Accepted Answer · 2022年3月14日 20:17

Is one-hot encoding an option?

It seems like no, due to the high cardinality of your feature, it might result in the course of dimensionality problems if your sample size is small and also if you are using mean decrease impurity as a measure of feature importance you have to consider the bias to high cardinality features.

So to avoid having that many categories ~200, you could group them. You could for example check the distribution on the train set of this feature and group those whose representativeness is below x% as OTHERS category.

If I just use the original values, 1001, 1002, etc., will random forest think that department 1002 is higher than department 1001?

Yes, it will be treated as a continuous feature and then a nonsense order will be established.

What options do I have?

The simplest, yet most efficient way of encoding categorical features is Target encoding, in short:

Target encoding is the process of replacing a categorical value with the mean of the target variable. Any non-categorical columns are automatically dropped by the target encoder model.

You could remove the target value of the observation $i$ to avoid leakage.

There is another alternative named WOE, which is a more sophisticated encoding in logarithmic scale that is highly used in credit scoring

None of those encodings will increase the feature dimension.

Finally, if you are using python, both aforementioned and many other encodings are available in CategoryEncoders package.

How to handle categorical variables with Random Forest using Scikit Learn?

About