Dealing with categorical variables in Isolation Forest

Isolation Forest is widely used when dealing with outlier/anomaly detection when we have no labels. The theory behind is that making random split at random points and counting how many splits you do to isolate a feature will help you determine if an instance is or not an outlier.

I have categorical features and I am not sure how to deal with them:

  • Label Encoding: Will misrepresent the data in euclidean space.
  • One Hot Encoding: Will give me more features and since the source code first selects the columns and then the values, it will give a non-realistic probability for my algorithm to select the one hot encoded
  • Target Encoding wont work since we have no target

How to properly encode categorical features in Isolation Forest? Could we encode categorical features in a space that suits the algorithm

Topic isolation-forest unsupervised-learning decision-trees categorical-data machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.