How do I determine the top "reason" for anomaly when using Isolation Forests

I am using Isolation Forests for Anomaly Detection. Say, my set has 10 variables, var1, var2, ..., var10, and I found an anomaly. Can I rank the 10 variables var1, var2, ..., var10 in such a way I can say that I have an anomaly and the main reason is, say, var6.

For example, if I had var1, var2, var3 only, and my set were:

5   25   109
7   26   111
6   23   108
6   26   109
6   978  108
5   25   110
7   24   107

I would say that 6, 978, 108 is an anomaly and, in particular, the reason is var2.

Is there a way to determine the main reason why a particular entry is an anomaly?

Topic isolation-forest anomaly anomaly-detection outlier machine-learning

Category Data Science


Since a while back one can use SHAP to exlain scikit-learn Isolation Forest models. Example code and output in this answer.


A naive approach would be to use a supervised model to predict the target anomaly vs no anomaly that your IsolationForest model outputs, then if and only if this supervised binary classification model performs well(maybe you can use cv score), you can use your favorite feature importance tool to examine the impact/contribution of each feature

  1. Mean decrease impurity if your model is a tree-based one (useful also to plot the/a tree of your model to understand the rules that make an observation an outlier.
  2. Permutation importance for model agnostic (pre defined metric)
  3. SHAP values for knowing more precisely the influence of each feature on your target (anomaly/no anomaly)

Edit:

I just make some research and I found that SHAP library has support for Isolation forest (check)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.