Anomaly detection using clustering of highly correlated Categorical data

My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC--XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands of combinations. I already tried KModes clustering where a number of clusters = unique values in column1. However, each cluster does not have equal density hence some bad data with high density is classified as normal and good data with low density is marked anomalous.

I want to have unsupervised algo where I can force it to use column1 as the primary criteria for clustering. One with the highest frequency of column2 data for each unique value of column1 is good data. Rest is anomalous. Kindly suggest what would be the best algo and how to approach this problem.

Topic anomaly-detection scikit-learn categorical-data clustering

Category Data Science


One option is counting patterns. Then define less common occurring patterns as an anomalies.

The counting approach is deterministic, whereas clustering is probabilistic. It might solve your problem. If not, it will at least provide summary statistics and a baseline model.


Your problem is actually a regression problem rather than general clustering, you look for the values far away from the regression line, outliers in the sense of regression. Therefore fit the regression line and filter the values with the largest residual errors which are your "bad" values in the sense of not following the correlation structure given by your two variables.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.