Custom Encoding for Categorical Features - sklearn

Question

Custom Encoding for Categorical Features - sklearn

ranger101

2022年4月18日 11:07

Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier.

As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model flawed, though I wasn't able to find literature that backs this). For the cat feature set in the picture, OneHotEncoding appears to be goto one, as the feature values do not have any inherent order. But since few features have high cardinality and OHE could result in very sparse matrix, we thought of encoding it a bit different as follows:

Cat Features - LabelEncoder - Binarize label encoded values

For eg. say, we have a categorical feature called City,

|City|
------
Paris
NY
London
Tokyo

- After Label Encoding

|City|
------
0
1
2
3

- After binarization

|City_I|City_II|
----------------
0      | 0
0      | 1
1      | 0
1      | 1

Is there any problem using this custom encoding with tree-based models? Will the performance metrics related to the model help prove that this sort of encoding doesn't tamper with the conceptual soundness of the model?

Topic scikit-learn python categorical-data machine-learning

Category Data Science

Sanjar Adilov · Accepted Answer · 2022年2月18日 08:37

I'm afraid your solution makes no sense. In your example, 'Paris' and 'NY' are equal in City_I, 'NY' and 'Tokyo' in City_II, etc. What would this mean when splitting nodes?

Try instead category_encoders. Also, HistGradientBoostingClassifier has native support for categorical features.

Gary Ong · Accepted Answer · 2022年2月18日 06:40

There is nothing inherently wrong with any of these encodings. The best way to find out which works best is to use a validation set to verify your encoding methods. Label encoding is usually not preferred for sklearn tree based models because the model treats it as a numerical value and might form a decision tree such as if x>5 go to left tree else go to right tree which does not make any sense. One hot encoding solves this issue but uses alot of memory. One option is to use other models such as xgboost which allow you to pass in integers as categorical variables.

Custom Encoding for Categorical Features - sklearn

About