Anomaly detection on sparse categorical data

Question

Anomaly detection on sparse categorical data

DataLover

2021年3月16日 16:38

I have a big dataset with a column clientid and a categorical column choice. I want to find out what are the clients that have strange combinations of choices (less frequent ones) and being able in the future to identify new strange combinations of future clients immediately.

clientid	choice
cl1	a
cl2	b
cl2	c
cl3	d
cl4	b
cl4	c

If I transpose the table by clientID I have a row for each client and different columns based on the choices, it will became a sparse dataset with categorical variables (choices). Some clients have only one choice and some have multiple ones and I want to find outlier records (clientid)

Which type of algorithm could help me in this type of problem? It is unsupervised, so I dont know what are the normal combinations and it is sparse data on categorical variables.

Topic sparse unsupervised-learning anomaly-detection categorical-data

Category Data Science

WBM · Accepted Answer · 2021年3月16日 16:38

No need for machine learning here.

After you've transposed the dataframe, just count the number of unique combinations in the new column, and then rank them by frequency. Set a suitable threshold of "rareness" (like freq=2 below) and you will have your list of strange combinations.

There's a tool in Pandas for this called df.values_count()

e.g.

combination	freq
a,b	1
a,c	1
a,d	1
a,b,c	2
a,b,c,d	10
b,d	10

Then just compare you new combinations with your "bank of rare combinations", and update them if they are no longer rare.

Anomaly detection on sparse categorical data

About