How do I determine the top "reason" for anomaly when using Isolation Forests

Question

How do I determine the top "reason" for anomaly when using Isolation Forests

user

2021年9月23日 09:09

I am using Isolation Forests for Anomaly Detection. Say, my set has 10 variables, var1, var2, ..., var10, and I found an anomaly. Can I rank the 10 variables var1, var2, ..., var10 in such a way I can say that I have an anomaly and the main reason is, say, var6.

For example, if I had var1, var2, var3 only, and my set were:

5   25   109
7   26   111
6   23   108
6   26   109
6   978  108
5   25   110
7   24   107

I would say that 6, 978, 108 is an anomaly and, in particular, the reason is var2.

Is there a way to determine the main reason why a particular entry is an anomaly?

Topic isolation-forest anomaly anomaly-detection outlier machine-learning

Category Data Science

Jon Nordby · Accepted Answer · 2021年9月23日 09:09

1

Jon Nordby answered at 2021年9月23日 09:09

Since a while back one can use SHAP to exlain scikit-learn Isolation Forest models. Example code and output in this answer.

Multivac · Accepted Answer · 2021年8月26日 15:55

A naive approach would be to use a supervised model to predict the target anomaly vs no anomaly that your IsolationForest model outputs, then if and only if this supervised binary classification model performs well(maybe you can use cv score), you can use your favorite feature importance tool to examine the impact/contribution of each feature

Mean decrease impurity if your model is a tree-based one (useful also to plot the/a tree of your model to understand the rules that make an observation an outlier.
Permutation importance for model agnostic (pre defined metric)
SHAP values for knowing more precisely the influence of each feature on your target (anomaly/no anomaly)

Edit:

I just make some research and I found that SHAP library has support for Isolation forest (check)

How do I determine the top "reason" for anomaly when using Isolation Forests

About