Anomaly detection - relation between thresholds and anomalies

Question

Anomaly detection - relation between thresholds and anomalies

Giordano

2022年5月20日 18:06

I'm developing an anomaly detection program in Python.
Main idea is to create a new LSTM model every day, training it with the previous 7 days and predict the next day.
Then, using thresholds, find anomalies day by day.

I've already implemented that and these thresholds are working well:

upper threshold is equals to trimmed_mean + (K * interquartile_range)
lower threshold is equals to trimmed_mean - (K * interquartile_range)

where trimmed_mean and interquartile_range are calculated on prediction error (real curve - predicted one) and K is set to 5.

What i would like to know is if there is a rule or a method to correctly configure the value of K. Because at the moment, my approach was to adjust thresholds looking at false positive.

I'm looking for any relation between anomalies and how to set up thresholds.

I already tried to calculate for each model the AUC (area under the curve) and looking for relation, but without success. In fact, since i create a new model every day, and for the most of the days I don't have any anomalous value, i'm not able to calculate the true positive rate correctly.

Thank you

EDIT
Based on Ben's comment, I'm going to add details about my problem.
Let's start to explain better what anomalous value is for me.
So, I'm analyzing timeseries created by acquisition of kWh each 5 minutes, so 288 records per day.
An anomalous value is basically a peak, so a value that is very different from the others.
Furthermore, i need to create a model day by day because customer has not too much data and I need to catch also seasonality.
Last but not least, I working on a univariate problem, so I have just the acquisition value and time, that's it.

EDIT REPLYING BEN'S QUESTIONS
First of all yes, kWh means power consumption.
I have another type of anomalies that are blocked values (like 0, 0, 0 ecc.), but i think that i can identify them "adding a simple rule" which detect these situations.
About seasonality:
Each new model is trained using the previous 7 days, and day by day this time window moves forwarding. So, every day model is deleted and trained using just the previous 7 days.
I already managed the issue that you mentioned (train new model with anomalous data) substituting anomalous data with a new value calculated using the difference between next value and previous value.

Here a couple of plots of my data:
image 1 image 2

where green line is the prediction and blue one is the real timeseries

Topic unsupervised-learning anomaly-detection time-series python machine-learning

Category Data Science

Ben · Accepted Answer · 2019年12月6日 06:47

To give a helpful answer more information is necessary. When you want to detect anomalies, I would say, from my feeling and understanding, that it doesn't make sense to train the model every day. You should define your normal system once and try to catch anomalies from there. Trends are hard to catch but this is always hard.

Identifying anomalies means also you have to know your system but here I have to refer to the first sentence: What is your system? Is it a technical system (then the task becomes quite complex) or is it "just" some pattern recognition or fraud detection and so on?

However, when I did that for a large machine with ~60 variables (and 12 months of data), I used the first 3 months to train and defined this state as normal. Then I set the thresholds such that an anomaly is identified when the RMSE reaches $ \sqrt{max (X_i)} $. This worked quite well with the exception of 1-2 variables. But as I said, it depends on your situation. For example, when you have a steam machine and the normal temperature is at 98 °C and everything bursts at 101 °C, you probably have to use a custom threshold as e.g. a standard deviation wouldn't help here much.

To add some more information: Especially in case of anomaly detection ,it is often the case that one approach is not sufficient. I would implement some basic approaches like using standard deviations for the most trivial cases which definitely have to be detected and the more advanced methods (like LSTM) to track more complex patterns.