Why do Isolation Forest implementations turn it into a supervised learning problem (with random values for the target?)

Question

Why do Isolation Forest implementations turn it into a supervised learning problem (with random values for the target?)

FlyingPickle

2020年9月23日 14:25

I am looking at various implementations of the Isolation Forest in python and R. Both sklearn in python and solitude in R use a y variable with the ExtraTrees regressor.

Since, Isolation Forest is unsupervised, I am wondering why it is being turned into a supervised problem? Wouldnt this be an issue when scoring on previously unseen data sets?

For example sklearn (python) line 248 has this.

And in solitude line 144 as well.

Topic isolation-forest python r

Category Data Science

Ben Reiniger · Accepted Answer · 2020年9月22日 16:30

Extra-random Trees needs a target variable, so Isolation Forest generates a random target (sklearn, solitude). At prediction time, no y values are used, and the ExtraTrees doesn't actually make a prediction; instead, the samples are propagated to the leaves and the depth is extracted (sklearn).

As for the tree-building process, sklearn at least doesn't make use of the y values, because the ExtraTrees model has max_features=1 and splitter='random' (source). I'm not so sure about solitude, since it has mtry=ncol-1 (source); maybe further down, using splitrule='extratrees' takes care of that? Otherwise, the splits chosen will try to optimize on the random y, though since those are random it maybe doesn't matter (certainly I wouldn't call it a supervised model, anyway).

Why do Isolation Forest implementations turn it into a supervised learning problem (with random values for the target?)

About