I will try to clarify the point as best as I can.
Ideally a model for anomaly détection should be trained with typical data, so that atypical data (anomalies) stand out.
However in practice this may not be achievable. So one can train the model on mixed data, provided the relative percentages of typical vs atypical cases are overwhelmingly high. (How high the relative percentages should be is a grey area depending on model used, type of data, etc.. as in many cases in machine learning there is no fixed hard limit)
Thus in this case the model will learn typical data with very high accuracy, thus can be used to detect atypical data.
Hope above analysis is clear enough.
As following references point out, approaches to anomaly detection and training depends on whether data are labeled and whether supervised or unsupervised algorithms are used.
If data are labeled, even if mixed, then supervised algorithms will usually work (including LSTMs). If data are mixed and unlabeled, then some clustering method (eg kmeans) can help in partitioning the data in typical/atypical sets and then proceed as previously.
Some references on variations regarding anomaly detection problems and their approach:
- Anomaly Detection with Machine Learning: An Introduction
- Supervised
Training data is labeled with “nominal” or “anomaly”.
The supervised setting is the ideal setting. It is the instance when a
dataset comes neatly prepared for the data scientist with all data
points labeled as anomaly or nominal. In this case, all anomalous
points are known ahead of time. That means there are sets of data
points that are anomalous, but are not identified as such for the
model to train on.
Popular ML algorithms for structured data:
- Support vector machine learning
- k-nearest neighbors (KNN)
- Bayesian networks
- Decision trees
- Clean
In the Clean setting, all data are assumed to be “nominal”, and it is
contaminated with “anomaly” points.
The clean setting is a less-ideal case where a bunch of data is
presented to the modeler, and it is clean and complete, but all data
are presumed to be nominal data points. Then, it is up to the modeler
to detect the anomalies inside of this dataset.
- Unsupervised
In Unsupervised settings, the training data is unlabeled and consists
of “nominal” and “anomaly” points.
The hardest case, and the ever-increasing case for modelers in the
ever-increasing amounts of dark data, is the unsupervised instance.
The datasets in the unsupervised case do not have their parts labeled
as nominal or anomalous. There is no ground truth from which to expect
the outcome to be. The model must show the modeler what is anomalous
and what is nominal.
“The most common tasks within unsupervised learning are clustering,
representation learning, and density estimation. In all of these
cases, we wish to learn the inherent structure of our data without
using explicitly-provided labels.”- Devin Soni
In the Unsupervised setting, a different set of tools are needed to
create order in the unstructured data. In unstructured data, the
primary goal is to create clusters out of the data, then find the few
groups that don’t belong. Really, all anomaly detection algorithms are
some form of approximate density estimation.
Popular ML Algorithms for unstructured data are:
- Self-organizing maps (SOM)
- K-means
- C-means
- Expectation-maximization meta-algorithm (EM)
- Adaptive resonance theory (ART)
- One-class support vector machine
- Factor Analysis of Mixed Data for Anomaly Detection
Anomaly detection aims to identify observations that deviate from the
typical paern of data. Anomalous observations may corre- spond to
nancial fraud, health risks, or incorrectly measured data in
practice. We show detecting anomalies in high-dimensional mixed data
is enhanced through rst embedding the data then as- sessing an
anomaly scoring scheme. We focus on unsupervised detection and the
continuous and categorical (mixed) variable case. We propose a
kurtosis-weighted Factor Analysis of Mixed Data for anomaly detection,
FAMDAD, to obtain a continuous embedding for anomaly scoring. We
illustrate that anomalies are highly separable in the rst and last
few ordered dimensions of this space, and test various anomaly scoring
experiments within this subspace. Results are illustrated for both
simulated and real datasets, and the pro- posed approach (FAMDAD) is
highly accurate for high-dimensional mixed data throughout these
diverse scenarios.
- A comprehensive survey of anomaly detection techniques for high dimensional big data
Anomaly detection in high dimensional data is becoming a fundamental
research problem that has various applications in the real world.
However, many existing anomaly detection techniques fail to retain
sufficient accuracy due to so-called “big data” characterised by
high-volume, and high-velocity data generated by variety of sources.
This phenomenon of having both problems together can be referred to
the “curse of big dimensionality,” that affect existing techniques in
terms of both performance and accuracy. To address this gap and to
understand the core problem, it is necessary to identify the unique
challenges brought by the anomaly detection with both high
dimensionality and big data problems. Hence, this survey aims to
document the state of anomaly detection in high dimensional big data
by representing the unique challenges using a triangular model of
vertices: the problem (big dimensionality), techniques/algorithms
(anomaly detection), and tools (big data applications/frameworks).
Authors’ work that fall directly into any of the vertices or closely
related to them are taken into consideration for review. Furthermore,
the limitations of traditional approaches and current strategies of
high dimensional data are discussed along with recent techniques and
applications on big data required for the optimization of anomaly
detection.
(posted from my smartphone)