Anomaly detection using clustering of highly correlated Categorical data
My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC--XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands of combinations. I already tried KModes clustering where a number of clusters = unique values in column1. However, each cluster does not have equal density hence some bad data with high density is classified as normal and good data with low density is marked anomalous.
I want to have unsupervised algo where I can force it to use column1 as the primary criteria for clustering. One with the highest frequency of column2 data for each unique value of column1 is good data. Rest is anomalous. Kindly suggest what would be the best algo and how to approach this problem.
Topic anomaly-detection scikit-learn categorical-data clustering
Category Data Science