Custom Encoding for Categorical Features - sklearn
Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier.
As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model flawed, though I wasn't able to find literature that backs this). For the cat feature set in the picture, OneHotEncoding appears to be goto one, as the feature values do not have any inherent order. But since few features have high cardinality and OHE could result in very sparse matrix, we thought of encoding it a bit different as follows:
Cat Features - LabelEncoder - Binarize label encoded values
For eg. say, we have a categorical feature called City,
|City|
------
Paris
NY
London
Tokyo
- After Label Encoding
|City|
------
0
1
2
3
- After binarization
|City_I|City_II|
----------------
0 | 0
0 | 1
1 | 0
1 | 1
Is there any problem using this custom encoding with tree-based models? Will the performance metrics related to the model help prove that this sort of encoding doesn't tamper with the conceptual soundness of the model?
Topic scikit-learn python categorical-data machine-learning
Category Data Science