Custom Encoding for Categorical Features - sklearn

Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier.

As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model flawed, though I wasn't able to find literature that backs this). For the cat feature set in the picture, OneHotEncoding appears to be goto one, as the feature values do not have any inherent order. But since few features have high cardinality and OHE could result in very sparse matrix, we thought of encoding it a bit different as follows:

Cat Features - LabelEncoder - Binarize label encoded values

For eg. say, we have a categorical feature called City,

|City|
------
Paris
NY
London
Tokyo

- After Label Encoding

|City|
------
0
1
2
3

- After binarization

|City_I|City_II|
----------------
0      | 0
0      | 1
1      | 0
1      | 1

Is there any problem using this custom encoding with tree-based models? Will the performance metrics related to the model help prove that this sort of encoding doesn't tamper with the conceptual soundness of the model?

Topic scikit-learn python categorical-data machine-learning

Category Data Science


I'm afraid your solution makes no sense. In your example, 'Paris' and 'NY' are equal in City_I, 'NY' and 'Tokyo' in City_II, etc. What would this mean when splitting nodes?

Try instead category_encoders. Also, HistGradientBoostingClassifier has native support for categorical features.


There is nothing inherently wrong with any of these encodings. The best way to find out which works best is to use a validation set to verify your encoding methods. Label encoding is usually not preferred for sklearn tree based models because the model treats it as a numerical value and might form a decision tree such as if x>5 go to left tree else go to right tree which does not make any sense. One hot encoding solves this issue but uses alot of memory. One option is to use other models such as xgboost which allow you to pass in integers as categorical variables.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.