Open source Anomaly Detection in Python
Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of various parameters. Each parameter is numeric (float) and there is a non-trivial/non-error value for each time point. My task is to monitor said log files for anomaly detection (spikes, falls, unusual patterns with some parameters being out of sync, strange 1st/2nd/etc. derivative behavior, etc.).
On a similar assignment, I have tried Splunk with Prelert, but I am exploring open-source options at the moment.
Constraints: I am limiting myself to Python because I know it well, and would like to delay the switch to R and the associated learning curve. Unless there seems to be overwhelming support for R (or other languages/software), I would like to stick to Python for this task.
Also, I am working in a Windows environment for the moment. I would like to continue to sandbox in Windows on small-sized log files but can move to Linux environment if needed.
Resources: I have checked out the following with dead-ends as results:
Some info here is helpful, but unfortunately, I am struggling to find the right package because:
Twitter's AnomalyDetection is in R, and I want to stick to Python. Furthermore, the Python port pyculiarity seems to cause issues in implementing in Windows environment for me.
Skyline, my next attempt, seems to have been pretty much discontinued (from github issues). I haven't dived deep into this, given how little support there seems to be online.
scikit-learn I am still exploring, but this seems to be much more manual. The down-in-the-weeds approach is OK by me, but my background in learning tools is weak, so would like something like a black box for the technical aspects like algorithms, similar to Splunk+Prelert.
Problem Definition and Questions: I am looking for open-source software that can help me with automating the process of anomaly detection from time-series log files in Python via packages or libraries.
- Do such things exist to assist with my immediate task, or are they imaginary in my mind?
- Can anyone assist with concrete steps to help me to my goal, including background fundamentals or concepts?
- Is this the best StackExchange community to ask in, or is Stats, Math, or even Security or Stackoverflow the better options?
EDIT [2015-07-23] Note that the latest update to pyculiarity seems to be fixed for the Windows environment! I have yet to confirm, but should be another useful tool for the community.
EDIT [2016-01-19] A minor update. I had not time to work on this and research, but I am taking a step back to understand the fundamentals of this problem before continuing to research in specific details. For example, two concrete steps that I am taking are:
Starting with the Wikipedia articles for anomaly detection, understanding fully, and then either moving up or down in concept hierarchy of other linked Wikipedia articles, such as this, and then this.
Exploring techniques in the great surveys done by Chandola et al 2009 Anomaly Detection: A Survey and Hodge et al 2004 A Survey of Outlier Detection Methodologies.
Once the concepts are better understood (I hope to play around with toy examples as I go to develop the practical side as well), I hope to understand which open source Python tools are better suited for my problems.
EDIT [2020-02-04] It has been a few years since I worked on this problem, and am no longer working on this project, so I will not be following or researching this area until further notice. Thank you very much to all for their input. I hope this discussion helps others that need guidance on anomaly detection work.
FWIW, if I had to do the same project now with the same resources (few thousand USD in expenses), I would pursue the deep learning/neural network approach. The ability of the method to automatically learn structure and hierarchy via hidden layers would've been very appealing since we had lots of data and (now) could spend the money on cloud compute. I would still use Python though ;).
Cheers!
Topic anomaly-detection library python data-mining machine-learning
Category Data Science