Are cluster feature and micro-cluster good summary statistics for outlier detection in high dimensional data streams?

I'm dealing with outlier detection in data streams. I'm looking for a way to summarize my data and obtain important statistics such as means and variance, etc. I want to know if the cluster features or microclusters are suitable or not.

Topic anomaly anomaly-detection outlier data-stream-mining clustering

Category Data Science


Traditional clustering algorithm which uses Euclidean based distance fails to yield good results in high dimensional data due to Curse of dimensionality

Because mean distance between data points diverges and looses its meaning which in turn leads to the divergence of the Euclidean distance, the most common distance used for clustering.

So if you are using any Euclidean based clustering algorithm i would highly suggest not to do that.

But if your clustering algorithm is not impacted by High demensionality problem like Hierarchical DB Scan you can do what you are suggesting


No.

Because assignment to microclusters is distance-based, and distances do not work in high-dimensional data anymore. Most likely one mucrocluster will become most central by chance and collect all the samples.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.