How can I detect anomalies/outliers in my online streaming data on a real-time basis?

Question

How can I detect anomalies/outliers in my online streaming data on a real-time basis?

Goutam Bose

2020年8月2日 16:03

Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.

I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.

I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.

Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.

Topic stacked-lstm unsupervised-learning anomaly-detection deep-learning classification

Category Data Science

Harsh · Accepted Answer · 2018年11月11日 00:46

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.

However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.

I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.

The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)

There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.

How can I detect anomalies/outliers in my online streaming data on a real-time basis?

About