Anomaly detection on sparse categorical data

I have a big dataset with a column clientid and a categorical column choice. I want to find out what are the clients that have strange combinations of choices (less frequent ones) and being able in the future to identify new strange combinations of future clients immediately.

clientid choice
cl1 a
cl2 b
cl2 c
cl3 d
cl4 b
cl4 c

If I transpose the table by clientID I have a row for each client and different columns based on the choices, it will became a sparse dataset with categorical variables (choices). Some clients have only one choice and some have multiple ones and I want to find outlier records (clientid)

Which type of algorithm could help me in this type of problem? It is unsupervised, so I dont know what are the normal combinations and it is sparse data on categorical variables.

Topic sparse unsupervised-learning anomaly-detection categorical-data

Category Data Science


No need for machine learning here.

After you've transposed the dataframe, just count the number of unique combinations in the new column, and then rank them by frequency. Set a suitable threshold of "rareness" (like freq=2 below) and you will have your list of strange combinations.

There's a tool in Pandas for this called df.values_count()

e.g.

combination freq
a,b 1
a,c 1
a,d 1
a,b,c 2
a,b,c,d 10
b,d 10

Then just compare you new combinations with your "bank of rare combinations", and update them if they are no longer rare.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.