Dealing with categorical variables in Isolation Forest
Isolation Forest is widely used when dealing with outlier/anomaly detection when we have no labels. The theory behind is that making random split at random points and counting how many splits you do to isolate a feature will help you determine if an instance is or not an outlier.
I have categorical features and I am not sure how to deal with them:
- Label Encoding: Will misrepresent the data in euclidean space.
- One Hot Encoding: Will give me more features and since the source code first selects the columns and then the values, it will give a non-realistic probability for my algorithm to select the one hot encoded
- Target Encoding wont work since we have no target
How to properly encode categorical features in Isolation Forest? Could we encode categorical features in a space that suits the algorithm