Semi-supervised anomaly detection

I am currently exploring anomaly detection methods for my work and, basically I have gone through Local Oulier Factor and Isolation Forests, both unsupervised methods.

Now, the thing is, there might be a chance that I do not want a point that is far away to considered as an outlier, and so I would need some sort of supervised or semi supervised method for the outlier detection.

So what I am thinking is:

1.Label a bunch of points as outlier using LOF/IF.

2.Train a classifier on top of the labels, and then make manual adjustements if needed.

Is this what is considered a semi-supervised method? Does anybody have any experience with this sort of problem that could say if I am missing something here?

Also, because I am labeling outliers the dataset will be very unbalanced. My idea is to use bagging for this. Let's say my dataset is 1% outliers, I would train 100 equally proportional models (the outliers parts remains the same on each model, but the normal points change until I go over the entirety of the dataset) and then the final prediction is a vote of all the models. Is this stupid or a good idea?

Topic anomaly-detection semi-supervised-learning outlier class-imbalance

Category Data Science


If you use the anomaly detector to label the data directly, there is no way the supervised step that follows can be better than that. One can of course go in an "adjust" the labels afterwards, but there is risk of being biased by the pre-existing labels if a human sees it up front.

Instead of sampling data to label randomly, you could sample weighted based on the anomaly score. This has two effects 1) reduces class imbalance 2) focuses labeling effort on likely anomalies

This assumes a well-tuned anomaly detector though. Which there is no good way to do without a validation set. That dataset should preferably be sampled randomly, to avoid bias.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.