Anomaly detection without any knowledge about structure

I have an interesting question, my code needs to be able to handle structured data where I don't know much about the structure at development time. I know the samples follow a schema that can be nested and the leafs contain some basic primitives like floats, integers and categoricals. I will have access to the schema at training time. I want to train a model that can detect anomalies or more directly whether or not a sample comes from the original distribution of data.

One way would be to just flatten the data and train an isolation forest. From my intuition, this breaks with highly correlated features. Another approach would be to build a neural network architecture that contains some of the structure and primitive types of the passed schema. You could build an autoencoder approach and measure reconstruction error, which I feel like would handle correlated features a lot better because these correlations are captured by the encoder-decoder network. A possible issue would be how to build this reconstruction error metric with different primitives.

Does anyone have any ideas on this or possibly papers that handle related challenges?

EDIT:

I will clarify some things, sorry for being vague. At the time of designing the algorithm I don't know the structure of the dataset involved but at the moment of training the model this structure is known. The fact that the schema could be nested is not so relevant, let's ignore that part and assume we have a flat, tabular dataset where we know the dtypes. It's likely these dtypes are mixed so I'm looking for a way to reason about anomalies on this set.

Topic anomaly

Category Data Science


The reconstruction error is standard and doesn't need much design, it is usually $\left \| x_i - x_i' \right \|^2$ which measures the difference between input vector $x_i$ and its reconstruction $x_i'$. Also, "without knowledge about structure" is too broad. If you don't know the informational units of your data, you cannot proceed.

First, you need to know all your features and their type and decide how to represent them numerically. For example, structure $x=(\mbox{age} \in [0, 100]\mbox{yrs}, \mbox{height} \in [0, 3]m, \mbox{education} \in \{level_1, level_2, level_3\})$ must be known a-priori. Then, $x$ can be represented numerically as a vector by normalizing values to unit-less quantities close to 1, and replacing categorical data with one-hot vectors. For example, replacing age = 25 with 25/100 = 0.25, height = 1.8 with 1.8/3= 0.6, education = $level_2$ with (0, 1, 0). This way, the final vector would be $x=(0.25, 0.6, 0, 1, 0).$

Let's assume the task is unsupervised, meaning data points are not tagged with "anomaly", and "normal" labels. To find an anomaly, a system must know the distribution of data to quantitatively assess $x$ with "I've seen many lookalikes before", or "I've seen nothing like this before". A straight-forward approach would be to fit a "Mixtures of Gaussians" (MG) to data. Suppose number of clusters is k=3 (k can be estimated, brute-forced in many ways). You feed $x$ to a trained MG and it returns $P(x)=\sum_{c \in \{1,2,3\}}P(x|c)P(c)$, you can now use $P(x)$ as a degree to which $x$ is an anomaly. The threshold is definitely domain-dependent, in some domains anomalies happen 1% of the time, in other ones they happen 0.0001% of the time. So, a decision could be: if $P(x) < 0.001$, $x$ is potentially an anomaly and must be investigated.

However, as the dimension of data increases from tens to hundreds and thousands, these methods begin to suffer from "curse of dimensionality", which stands for many performance problems due to high dimensions of data. This is where methods such as auto-encoders can help. You embed a 100 dimension data $x$ into a 2-10 dimension space; using the bottleneck layer of auto-encoder as the new representation of $x$. Then, MG can be used the same as before. There is an issue to consider, since auto-encoders try to compress the data, they may discard some useful information as noise that are true characteristics of anomalous data. This issue needs further tuning with domain-driven analysis; such as increasing the width of bottleneck layer, etc.


Graph Embeddings

You try to use graph embeddings. These would offer the ability to map your graph-based data into a single latent space, at based on the vertices, the edges or the graph schema as a whole. This has become somewhat of an exciting research topic within research over the last few years; combining graph analysis with modern embedding techniques.

Here is one survey paper that list approaches to these embeddings, along with the authors' own libraries for both static graphs and dynamic graphs. Looking through the results of that survey paper (they perform some becnhmarks towards the end), it would seem that the most promising model would be:


Anomaly Detection

As to the end goal of anomaly detection, I am not sure which of the embedding methods would be best suited. It might be a case as to whether you want to preserve more local or global structure (as with the trade-off between methods such as t-SNE and PCA).

I think you can imagine what to try out with the embeddings in order to detect anaomalies. There might be more clues in the papers linked above as to which way researchers tend to go for certain problems.

Given the hierarchical nature of your graph data, I would be tempted to start by looking into using Hierarchical Density Based Spatial Clustering with Applications of Noise (i.e. HDBSCAN) - the main library specifically includes a few few methods of outlier detection! Here a citation from the linked paper:

Based on these definitions, we can devise an algorithm DBSCAN* (similar to DBSCAN) that conceptually finds clusters as the connected components of a graph in which the objects of X are vertices and every pair of vertices is adjacent if and only if the corresponding objects are ε-reachable w.r.t. user-defined parameters ε and mpts. Noncore objects are labeled as noise.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.