Why do Isolation Forest implementations turn it into a supervised learning problem (with random values for the target?)

I am looking at various implementations of the Isolation Forest in python and R. Both sklearn in python and solitude in R use a y variable with the ExtraTrees regressor.

Since, Isolation Forest is unsupervised, I am wondering why it is being turned into a supervised problem? Wouldnt this be an issue when scoring on previously unseen data sets?

For example sklearn (python) line 248 has this.

And in solitude line 144 as well.

Topic isolation-forest python r

Category Data Science


Extra-random Trees needs a target variable, so Isolation Forest generates a random target (sklearn, solitude). At prediction time, no y values are used, and the ExtraTrees doesn't actually make a prediction; instead, the samples are propagated to the leaves and the depth is extracted (sklearn).

As for the tree-building process, sklearn at least doesn't make use of the y values, because the ExtraTrees model has max_features=1 and splitter='random' (source). I'm not so sure about solitude, since it has mtry=ncol-1 (source); maybe further down, using splitrule='extratrees' takes care of that? Otherwise, the splits chosen will try to optimize on the random y, though since those are random it maybe doesn't matter (certainly I wouldn't call it a supervised model, anyway).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.