The affect of bootstrap on Isolation Forest

I've been using isolation forest for anomaly detection, and reviewing its parameters at scikit-learn (link). Looking at bootstrap, I'm not quite clear what using bootstrap would cause. For supervised learning, this should reduce overfitting, but I'm not clear what the effect on anomaly detection should be.

I think it would require the trees to achieve more consensus about what the anomaly is, therefore, reducing the effect of any single feature. I.e, an anomalous observation would probably need to be anomalous consistently and over a number of features (?).

Is this a correct interpretation of this parameter?

Topic isolation-forest anomaly-detection python

Category Data Science


This is well explained on the original paper Section 3.

As well as in the Supervised Random Forest, Isolation Forest makes use of sampling on both, features and instances, so the latter in this case helps alleviate 2 main problems:

  1. Swamping

Swamping refers to wrongly identifying normal instances as anomalies. When normal instances are too close to anomalies, the number of partitions required to separate anomalies increases – which makes it harder to distinguish anomalies from normal in- stances.

  1. Masking

Masking is the existence of too many anomalies concealing their own presence.

Contrary to existing methods where large sampling size is more desirable, isolation method works best when the sampling size is kept small. Large sampling size reduces iForest’s ability to isolate anomalies as normal instances can interfere with the isolation process and therefore reduces its ability to clearly isolate anomalies. Thus, sub-sampling provides a favourable environment for iForest to work well. Throughout this paper, sub-sampling is conducted by ran- dom selection of instances without replacement.

enter image description here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.