Decision trees for anomaly detection

Problem

From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand.

  • What if it's not possible to label non-anomalous data to train the model?
  • Is there anything in literature that explain how to overcome this issue?

I have an idea, but I was wondering if someone has heard for something similar before, and could point me to the right direction (link papers/blogs, or explain existing methods).

Idea

I'd like to train a decision tree on a dataset $X$ with $N$ rows, $p$ columns, having a real valued target variable $Y$ (it's a regression problem). The dataset $X$ contains both anomalous and non-anomalous objects. The decision tree training process generates groups of objects, splitting the dataset iteratively along one dimension, at each iterations. The decision trees during prediction assigns an object to a specific leaf node. Each leaf node will have a certain distribution of values of the target variable Y. An anomalous object in this problem it's the one which doesn't perform well, more precisely an object for which Y is too low.

  • Can I use the distribution in a leaf node to perform anomaly detection?
  • Assuming that a bigger value for Y is preferred, can I say that the entities in the lowest 5th percentile of Y in a node are outliers?

Example

The decision tree assigns to the node j the objects following the rule: $2 = x_1 4$ and $5 = x_2 7$, where $x_1$ and $x_2$ are two columns of the dataset. If I run a prediction on the entire dataset, the values of Y in the node j have have a gaussian distribution (mean=25, std=3), and I consider the values with $Y 20.5$ outliers. The idea is that objects with $x_1$ and $x_2$ within a range, should perform similarly, and never below a target value.

Considerations

*First of all, I can see that this problem can be generalized to other methods. However, I find decision trees easier to explain here, moreover, I find the hard clustering property of trees useful for my problem, as I also need to cluster objects together. *Second, I can see the issue of training the model with both anomalous and non-anomalous data. Though, I'm wondering whether due to some property, with a big amount of data, the leaf node's distribution move's toward the optimal (value of Y of the non-anomalous objects). But I guess this depends on the ration between anomalous vs non-anomalous objects.

Any hint in the right direction would help.

Topic anomaly-detection decision-trees random-forest clustering

Category Data Science


I will try to refer to the points I have some opinion about:

  1. "What if it's not possible to label non-anomalous data to train the model?" In this case, you face unsupervised learning problem. there are plenty of reading material regarding this topic and plenty of approaches. here is one for example: https://towardsdatascience.com/unsupervised-learning-for-anomaly-detection-44c55a96b8c1

this article may be good as well:

https://www.researchgate.net/publication/224375498_Anomaly_Detection_by_Combining_Decision_Trees_and_Parametric_Densities

  1. if you can distinguish anomalies by their distribution, it sounds that you do know the anomaly data, am I wrong?

  2. the issue with defining anomalies lower than 5th quantile, that it means that 5 % of your data that examined by the leaf will be considered as anomaly, so you if run you algorithm frequently, there is a good chance you will see a lot of anomalies. maybe try using confidence intervals.

  3. From what I have read, decision trees are not the classic method for anomaly detection. read about fraud detection, isolation trees, using SVM and there are more ways.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.