Machine Learning algorithm for detecting anomalies in large sets of events
Let's start with the following hypothetical preconditions:
- There is traffic: normal and anomaly. Each traffic sample contains a list of events (of variable size)
- Events happen in order, the possible events set size is ~40000 elements
- Should run on relatively small amounts of memory and processing power
Having a traffic sample (of size 1000 events max), what is the best machine learning algorithm, that fits the preconditions, to identify whether it's an anomaly?
Given my limited knowledge in machine learning algorithms, here is what I came up with:
This system can be described very well as a markov process, but there are huge memory limitations in this scenario
1. Reduced Markov Chains
Store frequent pairs of events (that appeared more than 10 times in normal traffic) and then search the some pair there: if it doesn't appear, count as an anomaly. Then use some heuristic to identify if the traffic as a whole is an anomaly.
I called it reduced, because practically, we are using only a chain of two events, any other bigger size would become a huge combinatorial problem and fill any memory it's been given, which is infeasible.
2. Naive KNN
Get all the normal traffic samples (each sample may contain up to 1000 events) and analyze the number of appearances of each event in each sample. Separate the dataset into 10 parts, compute their means to get the mean frequency for each part (basically, we now have 10 mean-frequency vectors) and use them as positive data-points inside the KNN algorithm
Do the same with anomaly traffic and add 10 data points. Having the points, we can now use the KNN algorithm regression to compute a score and make a decision.
This is a bit tricky, because the frequency vectors are quite big, have too many of them becomes a problem. A solution would be to implement sparse vectors
Any other ideas, what am I missing?
Topic k-nn markov-process time-series machine-learning
Category Data Science