Training data for anomaly detection using LSTM Autoencoder

Question

Training data for anomaly detection using LSTM Autoencoder

learnlifelong

2022年4月24日 11:12

I am building an time-series anomaly detection engine using LSTM autoencoder. I read this article where the author suggests to train the model on clean data only in response to a comment. However, in most cases, it is not possible to find and exlude anomalies manually. I had always believed that because anomalies are very rare, if we train the model on all the data then the model will learn the normal behavior of time series and be ready to detect anomalies. I have read the same notion in many other articles too. Can someone throw light on what should be right mechanism to prepare the training data for anomaly detection?

Topic lstm training autoencoder rnn anomaly-detection

Category Data Science

Nikos M. · Accepted Answer · 2022年4月24日 11:12

I will try to clarify the point as best as I can.

Ideally a model for anomaly détection should be trained with typical data, so that atypical data (anomalies) stand out.

However in practice this may not be achievable. So one can train the model on mixed data, provided the relative percentages of typical vs atypical cases are overwhelmingly high. (How high the relative percentages should be is a grey area depending on model used, type of data, etc.. as in many cases in machine learning there is no fixed hard limit)

Thus in this case the model will learn typical data with very high accuracy, thus can be used to detect atypical data.

Hope above analysis is clear enough.

As following references point out, approaches to anomaly detection and training depends on whether data are labeled and whether supervised or unsupervised algorithms are used.

If data are labeled, even if mixed, then supervised algorithms will usually work (including LSTMs). If data are mixed and unlabeled, then some clustering method (eg kmeans) can help in partitioning the data in typical/atypical sets and then proceed as previously.

Some references on variations regarding anomaly detection problems and their approach:

Anomaly Detection with Machine Learning: An Introduction

Supervised

Training data is labeled with “nominal” or “anomaly”.

The supervised setting is the ideal setting. It is the instance when a dataset comes neatly prepared for the data scientist with all data points labeled as anomaly or nominal. In this case, all anomalous points are known ahead of time. That means there are sets of data points that are anomalous, but are not identified as such for the model to train on.

Popular ML algorithms for structured data:

Support vector machine learning

k-nearest neighbors (KNN)

Bayesian networks

Decision trees

Clean

In the Clean setting, all data are assumed to be “nominal”, and it is contaminated with “anomaly” points.

The clean setting is a less-ideal case where a bunch of data is presented to the modeler, and it is clean and complete, but all data are presumed to be nominal data points. Then, it is up to the modeler to detect the anomalies inside of this dataset.

Unsupervised

In Unsupervised settings, the training data is unlabeled and consists of “nominal” and “anomaly” points.

The hardest case, and the ever-increasing case for modelers in the ever-increasing amounts of dark data, is the unsupervised instance. The datasets in the unsupervised case do not have their parts labeled as nominal or anomalous. There is no ground truth from which to expect the outcome to be. The model must show the modeler what is anomalous and what is nominal.

“The most common tasks within unsupervised learning are clustering, representation learning, and density estimation. In all of these cases, we wish to learn the inherent structure of our data without using explicitly-provided labels.”- Devin Soni

In the Unsupervised setting, a different set of tools are needed to create order in the unstructured data. In unstructured data, the primary goal is to create clusters out of the data, then find the few groups that don’t belong. Really, all anomaly detection algorithms are some form of approximate density estimation.

Popular ML Algorithms for unstructured data are:

Self-organizing maps (SOM)

K-means

C-means

Expectation-maximization meta-algorithm (EM)

Adaptive resonance theory (ART)

One-class support vector machine

Factor Analysis of Mixed Data for Anomaly Detection

Anomaly detection aims to identify observations that deviate from the typical paern of data. Anomalous observations may corre- spond to nancial fraud, health risks, or incorrectly measured data in practice. We show detecting anomalies in high-dimensional mixed data is enhanced through rst embedding the data then as- sessing an anomaly scoring scheme. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We propose a kurtosis-weighted Factor Analysis of Mixed Data for anomaly detection, FAMDAD, to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the rst and last few ordered dimensions of this space, and test various anomaly scoring experiments within this subspace. Results are illustrated for both simulated and real datasets, and the pro- posed approach (FAMDAD) is highly accurate for high-dimensional mixed data throughout these diverse scenarios.

A comprehensive survey of anomaly detection techniques for high dimensional big data

Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world. However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources. This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy. To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems. Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks). Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review. Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection.

(posted from my smartphone)

Training data for anomaly detection using LSTM Autoencoder

About