huge doubt on anomaly detection

Question

huge doubt on anomaly detection

Ram Varun

2022年5月12日 04:04

from the naked eye itself, we can tell in the region 5161 the network usage is high so that is the anomaly in my case, then why do we want to apply k-means and other machine learning algorithms to find anomalies in our data

Topic data-science-model anomaly-detection bigdata data-mining machine-learning

Category Data Science

earthgecko · Accepted Answer · 2018年10月5日 06:48

@ram-varun because we have too much data.

Whether you are doing analysis on a real time data stream or on a historic data set, there is too much data. Too much data for a person to be looking through manually and visually analysing.

Your example shows network usage, which means there are probably a lot of other metrics too. lo txpackets, lo rxpackets, eth0, cpu user, cpu system, etc, etc and that is just for one device and more often then not, it will belong to a > 1 population of devices.

Doing automated anomaly detection via a machine process is the only reasonable manner in which to do anomaly detection.

As for having to apply K-means or other ML algorithms to find anomalies in data in general, these are not the only options, just the fashionable ones. Many of the ML algorithms are computationally expensive and there are other simpler means to find anomalies or outliers which can be almost as effective as any machine learning algorithms, but much faster and cheaper.

With specific regards to the use of K-Means, the effectiveness of the use of clustering in anomaly detection has been proved ineffective for quite some time now, Eamonn Keogh and Jessica Lin 2005 paper - Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research - http://www.cs.ucr.edu/~eamonn/meaningless.pdf However there have been attempts to state and prove it works with test data sets - http://amid.fish/anomaly-detection-with-k-means-clustering

However the real problem here is not what method/s you use for anomaly detection, the real problem is what is an anomaly?

Whatever method/s you use, you are always going to detect things that "seem" anomalous, but actually are not anomalies at. The false positives. For example, that network usage (local activity) that could be a normal but infrequent occurrence in a large time window, the device is dumping a backup or the device is pulling down an update.

Arpit Sisodia · Accepted Answer · 2018年10月1日 04:32

@Ram , you have correctly said. You might not need any algorithm to detect these known abnormality.

1) But even if u apply K-means, It will give you same result and will do it in automatically way.

2) Seems like you have just 1 variable -total ectivities, things get dirtier when you have multiple variables then you might have to look into LOF/ABOD other algorithm after identifying what exactly is abnormality for you. Have a look here-

https://machinelearningstories.blogspot.com/2018/07/anomaly-detection-anomaly-detection-by.html

3) In ML , many a times you even do not know what exactly is an abnormlaity. Here you assume high activities is abnormality. depending on problem statement we need to have some unsupervised algo to identify possible abnormalities.

huge doubt on anomaly detection

About