Real-Time Outlier/Anomaly Detection?

My data is the usage/playing statistics for players of a specific game. One data point for a user is aggregated statistics for one week. The goal is to be able to detect when the account of the player was stolen/hacked/anything else went wrong. So my idea is for each player to have data points that each represent one week and then check whether the latest week is an outlier in the cluster. If it is - something is wrong with the account.

My question is what algorithm/method would be suitable for such situation? I am well familiar with clustering and things like autoencoders, but this doesn't feel very suited to my problem, because:

  • I have few samples for each user, i.e. we can go 25 weeks back so only 25 samples of what is 'right'.
  • I don't need outlier detection for all the data, what I need is to tell if the latest sample is an outlier with respect to the other data points.

Currently I have two ideas:

  • Dixon's Q-test.
  • Simply measuring whether the latest sample is further from the cluster center than all the other samples.

They could work, but they both sound a little 'hacky'. I feel like there should be a more elegant solution for such a relatively simple problem, but my mind is just blanking. What approach would you recommend?

Topic unsupervised-learning anomaly-detection outlier statistics clustering

Category Data Science


You could consider using the Holt-Winters time series model. It supports identification of trends (as you describe) and alternatively helps interpret the seasonal component as well. It is part of the statsmodels package.


For me it sounds like Time Series Anomaly Detection. You can follow the example here: https://towardsdatascience.com/time-series-anomaly-detection-with-pycaret-706a6e2b2427.

In Pycaret there are over 10 anomaly detection algorithms (wrapper of the pyod library). You should use several of them, play with the fraction (threshold) and check which combination works best for your data. In case that you have some collection of anomalities, you can use a classification (if you have several inputs in your dataset, that could describe the anomalities or if it is possible to generate inputs by yourself).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.