Anomaly (Outlier) Detection with Isolation Forest too sensitive even with low contamination

I'm trying to use the sklearn implementation of the Isolation Forest algorithm to detect anomalies in my time series data. However, even with a very low contamination parameter (0.0001), it is detecting things that should not be outliers in my opinion, as shown in the picture below:

While this is the highest peak of the data, it doesn't really seem anomalous to me. How can I configure an Isolation forest to only detect samples that are drastically different from the other samples? Maybe an isolation forest is not the right way to go here? All help appreciated.

Topic isolation-forest unsupervised-learning anomaly-detection outlier scikit-learn

Category Data Science

The trick is not in fine tuning the contamination parameter, but checking the score of the found anomalies. By default the algo considers all scores below -0.5 as anomalies. I'm checking the result and consider the ones below -0.6 as anomalies. I remember this info from the original paper saying the they suggest a limit of 0.6 themselves.

This is how you can get the scores, filter them:

model = IsolationForest()
result = model.fit_predict( data )
score = model.score_samples( data )

The time series data you display is your train set? If that is the case, and contamination parameter is not strictly 0, the isolation forest algorithm will find at least one sample to classify it as an anomaly by construction of the model itself on the training data. Below you can find a quick example on how trying very small contaminations (but still over 0) gives you one anomaly on your train dataset:

    # fit the model
clf = IsolationForest(max_samples=100, random_state=rng, contamination=0.00001)
y_pred_train = clf.predict(X_train)
X_error_train = X_train[y_pred_train == -1]
# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z,
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white',
                 s=20, edgecolor='k')
c = plt.scatter(X_error_train[:, 0], X_error_train[:, 1], c='red',
                s=40, edgecolor='k')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([b1, c],
           ["training observations", "training samples considered anomalies"],
           loc="upper left")
n_error_train = y_pred_train[y_pred_train == -1].size
print('number of anomalies: ', n_error_train)
print('error train ratio aprox. contamination: ', n_error_train/len(X_train))

with contamination = 0.01: enter image description here

and now with contamination = 0.00001:

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng, contamination=0.0001)

enter image description here

which, as you can see, is a ratio above the one defined as the contamination parameter. Nevertheless, if you define strictly contamination = 0, you have: enter image description here

I suggest you to test it on an independent set of data (not used for training).

Another interesting algorithm to detect novelties is One-Class support vector machine, you can find here a worked out example.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.