Open source Anomaly Detection in Python

Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of various parameters. Each parameter is numeric (float) and there is a non-trivial/non-error value for each time point. My task is to monitor said log files for anomaly detection (spikes, falls, unusual patterns with some parameters being out of sync, strange 1st/2nd/etc. derivative behavior, etc.).

On a similar assignment, I have tried Splunk with Prelert, but I am exploring open-source options at the moment.

Constraints: I am limiting myself to Python because I know it well, and would like to delay the switch to R and the associated learning curve. Unless there seems to be overwhelming support for R (or other languages/software), I would like to stick to Python for this task.

Also, I am working in a Windows environment for the moment. I would like to continue to sandbox in Windows on small-sized log files but can move to Linux environment if needed.

Resources: I have checked out the following with dead-ends as results:

  1. Some info here is helpful, but unfortunately, I am struggling to find the right package because:

  2. Twitter's AnomalyDetection is in R, and I want to stick to Python. Furthermore, the Python port pyculiarity seems to cause issues in implementing in Windows environment for me.

  3. Skyline, my next attempt, seems to have been pretty much discontinued (from github issues). I haven't dived deep into this, given how little support there seems to be online.

  4. scikit-learn I am still exploring, but this seems to be much more manual. The down-in-the-weeds approach is OK by me, but my background in learning tools is weak, so would like something like a black box for the technical aspects like algorithms, similar to Splunk+Prelert.

Problem Definition and Questions: I am looking for open-source software that can help me with automating the process of anomaly detection from time-series log files in Python via packages or libraries.

  1. Do such things exist to assist with my immediate task, or are they imaginary in my mind?
  2. Can anyone assist with concrete steps to help me to my goal, including background fundamentals or concepts?
  3. Is this the best StackExchange community to ask in, or is Stats, Math, or even Security or Stackoverflow the better options?

EDIT [2015-07-23] Note that the latest update to pyculiarity seems to be fixed for the Windows environment! I have yet to confirm, but should be another useful tool for the community.

EDIT [2016-01-19] A minor update. I had not time to work on this and research, but I am taking a step back to understand the fundamentals of this problem before continuing to research in specific details. For example, two concrete steps that I am taking are:

  1. Starting with the Wikipedia articles for anomaly detection, understanding fully, and then either moving up or down in concept hierarchy of other linked Wikipedia articles, such as this, and then this.

  2. Exploring techniques in the great surveys done by Chandola et al 2009 Anomaly Detection: A Survey and Hodge et al 2004 A Survey of Outlier Detection Methodologies.

Once the concepts are better understood (I hope to play around with toy examples as I go to develop the practical side as well), I hope to understand which open source Python tools are better suited for my problems.

EDIT [2020-02-04] It has been a few years since I worked on this problem, and am no longer working on this project, so I will not be following or researching this area until further notice. Thank you very much to all for their input. I hope this discussion helps others that need guidance on anomaly detection work.

FWIW, if I had to do the same project now with the same resources (few thousand USD in expenses), I would pursue the deep learning/neural network approach. The ability of the method to automatically learn structure and hierarchy via hidden layers would've been very appealing since we had lots of data and (now) could spend the money on cloud compute. I would still use Python though ;).


Topic anomaly-detection library python data-mining machine-learning

Category Data Science

There is still an active and developed version of Skyline, just in case someone lands here and is interested.

Skyline (documentation)

I am the current maintainer of the project and it is now a lot more advanced than the original Etsy version, in terms of performance, UI, better handling of seasonality and has the added functionalities of an anomalies database, calculating correlations and the ability to fingerprint and learn not anomalous patterns.

I am currently on same stage like you. I am finding best option for anomaly detection, doing some research.

What I have found is I think best matches your need and is better compare to what you have seen. i.e., TwitterAnomalyDetection, SkyLine.

I have found better is Numenta's NAB (Numenta Anomaly Benchmark). It also have a very good community support and for you plus point is its open source & developed in Python. You can add your algorithm in it.

In case of algorithm, I found LOF, or CBLOF are good option.

So, check it out once. It may help you out.

If you found better option, please share.

I recently developed a toolbox: Python Outlier Detection toolbox (PyOD). See GitHub.

It is designed for identifying outlying objects in data with both unsupervised and supervised approaches. PyOD is featured for:

  • Unified APIs, detailed documentation, and interactive examples across various algorithms.
  • Advanced models, including Neural Networks/Deep Learning and Outlier Ensembles.
  • Optimized performance with JIT and parallelization when possible, using numba and joblib. Compatible with both Python 2 & 3 (scikit-learn compatible as well).

Here are some important links:

If you use PyOD in a scientific publication, we would appreciate citations to the following paper

  title={PyOD: A Python Toolbox for Scalable Outlier Detection},
  author={Zhao, Yue and Nasrullah, Zain and Li, Zheng},
  journal={arXiv preprint arXiv:1901.01588},

It is currently under review at JMLR (machine learning open-source software track). See preprint.

Quick Introduction

PyOD toolkit consists of three major groups of functionalities: (i) outlier detection algorithms; (ii) outlier ensemble frameworks and (iii) outlier detection utility functions.

Individual Detection Algorithms:

  • PCA: Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes)
  • MCD: Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores)
  • OCSVM: One-Class Support Vector Machines
  • LOF: Local Outlier Factor
  • CBLOF: Clustering-Based Local Outlier Factor
  • LOCI: LOCI: Fast outlier detection using the local correlation integral
  • HBOS: Histogram-based Outlier Score
  • kNN: k Nearest Neighbors (use the distance to the kth nearest neighbor as the - **outlier score
  • AvgKNN: Average kNN (use the average distance to k nearest neighbors as the outlier score)
  • MedKNN: Median kNN (use the median distance to k nearest neighbors as the outlier score)
  • ABOD: Angle-Based Outlier Detection
  • FastABOD: Fast Angle-Based Outlier Detection using approximation
  • SOS: Stochastic Outlier Selection
  • IForest: Isolation Forest
  • Feature Bagging
  • LSCP: LSCP: Locally Selective Combination of Parallel Outlier Ensembles
  • XGBOD: Extreme Boosting Based Outlier Detection (Supervised)
  • AutoEncoder: Fully connected AutoEncoder (use reconstruction error as the outlier score)
  • SO_GAAL: Single-Objective Generative Adversarial Active Learning
  • MO_GAAL: Multiple-Objective Generative Adversarial Active Learning

Outlier Detector/Scores Combination Frameworks:

  • Feature Bagging
  • LSCP: LSCP: Locally Selective Combination of Parallel Outlier Ensembles
  • Average: Simple combination by averaging the scores
  • Weighted Average: Simple combination by averaging the scores with detector weights
  • Maximization: Simple combination by taking the maximum scores
  • AOM: Average of Maximum
  • MOA: Maximization of Average

Utility Functions for Outlier Detection:

  1. score_to_lable(): convert raw outlier scores to binary labels
  2. precision_n_scores(): one of the popular evaluation metrics for outlier mining (precision @ rank n)
  3. generate_data(): generate pseudo data for outlier detection experiment
  4. wpearsonr(): weighted pearson is useful in pseudo ground truth generation

Comparison of all implemented models are made available below: (Figure, Code, Jupyter Notebooks):enter image description here

If you are interested, please check Github for more information.

Since you have multivariate time series, I would go for a LSTM-RNN implementation that models the dynamics of your system based on training data, which are usually semi-supervised (only normal class included). This means that you train your model to learn what is "normal". During testing, you test both normal and anomalous conditions to see how well the model tells them apart.

An advantage of neural networks is that they "learn" the cross-correlations between input signals by themselves; you do not need to explore them manually. LSTM-RNNs, in particular, are an ideal choice when it comes to time series modelling simply because of their ability to keep memory of previous inputs, similar to a state space model in Control Theory (if you see the analogy).

In Python, it is almost trivial to implement an LSTM-RNN using Keras API (on top of Tensorflow backend). This network learns to estimate the signal(s) of interest given an arbitrary number of inputs, which you thereafter compare with the actual measured value. If there is "big" deviation, you got an anomaly (given that the model is accurate enough)!

I assume the feature you use to detect abnormality is one row of data in a logfile. If so, Sklearn is your good friend and you can use it as a blackbox. Check the tutorial of one-class SVM and Novelty detection.

However, in case that your feature is an entire logfile, you need to first summarize it to some feature of same dimension, and then apply Novealty detection.

h2o has an anomaly detection module and traditionally the code is available in R.However beyond version 3 it has similar module available in python as well,and since h2o is open source it might fit your bill.

You can see an working example over here

import sys
import h2o

def anomaly(ip, port):
    h2o.init(ip, port)

    print "Deep Learning Anomaly Detection MNIST"

    train = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/train.csv.gz"))
    test = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/test.csv.gz"))

    predictors = range(0,784)
    resp = 784

    # unsupervised -> drop the response column (digit: 0-9)
    train = train[predictors]
    test = test[predictors]

    # train unsupervised Deep Learning autoencoder model on train_hex
    ae_model = h2o.deeplearning(x=train[predictors], training_frame=train, activation="Tanh", autoencoder=True,
                                hidden=[50], l1=1e-5, ignore_const_cols=False, epochs=1)

    # anomaly app computes the per-row reconstruction error for the test data set
    # (passing it through the autoencoder model and computing mean square error (MSE) for each row)
    test_rec_error = ae_model.anomaly(test)

    # Let's look at the test set points with low/median/high reconstruction errors.
    # We will now visualize the original test set points and their reconstructions obtained
    # by propagating them through the narrow neural net.

    # Convert the test data into its autoencoded representation (pass through narrow neural net)
    test_recon = ae_model.predict(test)

    # In python, the visualization could be done with tools like numpy/matplotlib or numpy/PIL

if __name__ == '__main__':
    h2o.run_test(sys.argv, anomaly)

Anomaly Detection or Event Detection can be done in different ways:

Basic Way

Derivative! If the deviation of your signal from its past & future is high you most probably have an event. This can be extracted by finding large zero crossings in derivative of the signal.

Statistical Way

Mean of anything is its usual, basic behavior. if something deviates from mean it means that it's an event. Please note that mean in time-series is not that trivial and is not a constant but changing according to changes in time-series so you need to see the "moving average" instead of average. It looks like this:

Events are peaks larger than 1 standard deviation from moving average

The Moving Average code can be found here. In signal processing terminology you are applying a "Low-Pass" filter by applying the moving average.

You can follow the code bellow:

MOV = movingaverage(TimeSEries,5).tolist()
STD = np.std(MOV)
events= []
ind = []
for ii in range(len(TimeSEries)):
    if TimeSEries[ii] > MOV[ii]+STD:

Probabilistic Way

They are more sophisticated specially for people new to Machine Learning. Kalman Filter is a great idea to find the anomalies. Simpler probabilistic approaches using "Maximum-Likelihood Estimation" also work well but my suggestion is to stay with moving average idea. It works in practice very well.

I hope I could help :) Good Luck!


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.