Dealing with categorical variables in Isolation Forest

Question

Dealing with categorical variables in Isolation Forest

Carlos Mougan

2020年6月9日 09:47

Isolation Forest is widely used when dealing with outlier/anomaly detection when we have no labels. The theory behind is that making random split at random points and counting how many splits you do to isolate a feature will help you determine if an instance is or not an outlier.

I have categorical features and I am not sure how to deal with them:

Label Encoding: Will misrepresent the data in euclidean space.
One Hot Encoding: Will give me more features and since the source code first selects the columns and then the values, it will give a non-realistic probability for my algorithm to select the one hot encoded
Target Encoding wont work since we have no target

How to properly encode categorical features in Isolation Forest? Could we encode categorical features in a space that suits the algorithm

Topic isolation-forest unsupervised-learning decision-trees categorical-data machine-learning

Category Data Science

Dealing with categorical variables in Isolation Forest

About