Cross-Validation in Anomaly Detection with Labelled Data

I am working on a project where I train anomaly detection algorithms Isolation Forest and Auto-Encoder. My data is labelled so I have the ground truth but the nature of the problem requires unsupervised/semi-supervised anomaly detection approach rather than simple classification. Thus I will use the labels for validation only.

Since I will not train the model with the labels, unlike supervised learning where I would have X_train, X_test, y_train and y_test, what is the right approach for model validation here?

If this were supervised learning, I would split data into 3 parts: train, CV and test, doing K-Fold CV. But now I feel like I can simply divide my data into 2: train and test and simply fit all of the train data, predict and tune the models according to. Then finally, predict the test data.

So my question is, should I include some kind of CV in this model? What is the right way here?

Topic isolation-forest autoencoder anomaly-detection cross-validation scikit-learn

Category Data Science


In an Anomaly Detection scenario the labeled data is useful for several things:

  1. Hyper-parameter optimization. Selecting the anomaly threshold, feature/preprocessing settings, etc.
  2. Estimating performance on unseen data ("generalization").
  3. Estimating the robustness of our AD model pipeline

To be able to do 1. and 2. we at minimum need split into a validation set (to do hyper-parameter optimization on), an a test set (to estimate performance on). With small labeled data-sets (typical of Anomaly Detection) our estimates might be quite sensitive to minor changes. Then we might want to do multiple checks on validation and test sets, to get a distribution instead of a single score.

.3 is most relevant in deployment scenarios where AD models are trained/updated regularly and automatically. This is when the system being monitored has non-stationary distributions, typically changes over time. Thus we may want to test with multiple training sets, to check sensitivity to training data.

These consideration may lead us to use cross-validation. Either a single stage, selecting k tuples of train,val,test. Or nested cross-validation, doing k splits into train,testval and then k splits into val,test in the inner loop.

It is important to ensure independence between train/val/test sets. In many Anomaly Detection problems there might be time-dependencies or dependencies in subgroup (user/devices/etc). Using random splitting in such scenarios will lead to overconfident results.


After thinking thoroughly and having worked on the problem I mentioned, I will respond to my own question.

In supervised learning, we divide the data into three parts, namely train, dev and test sets. We use dev/validation test to see how our fitted model does in unseen data. What is unseen in this case, is the target variable, y_test. During training, y_train is seen by the model while y_test is kept unseen. In order to use every part of train and dev data, and eliminate the effect of chance, we do it K-Fold.

In case of unsupervised learning, the model does not see any y_train because only X_train is fitted. Since the model is not exposed to the labels, we can simply go with train and test set, rather than train, dev and test. In this scheme no K-Fold CV is required either because we can simply predict for all of the train + dev data at once.

Hence, it makes sense to use the labels of X_train, y_train, for validation during model development, and it is not required to use K-Fold CV.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.