Geolocation Based Anomaly Detection in IPs Using Isolation Forest

Question

Geolocation Based Anomaly Detection in IPs Using Isolation Forest

Nipun Sampath

2021年2月19日 17:03

I'm trying to detect anomalies based on geolocation from IP addresses on a server access log file. I have created two features country and geo_velocity, using the IP address and the timestamp of each request. However, since all the requests are from stationary clients and all the clients are from one country in the log file I have, my dataset ends up looking something like this.

| Country     | geo_velocity|
| ----------- | ----------- |
| USA         | 0           |
| USA         | 0           |
| USA         | 0           |

Basically, If I plot the whole dataset in a scatterplot, it would condense onto a single point. Therefore, literally, every other value to these features should be an anomaly for this dataset.

I used Isolation Forest with GridSearchCV to tune hyperparameters. And for the scoring parameter in GridSearchCV, I used a custom scoring function as shown in the code below. The problem is, the model classifies everything as inliers even though they should be clearly outliers.

from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV

# this is similar to my dataset after label encoding the country
data = np.zeros((5000,2),dtype=int)
dataset = pd.DataFrame(data, columns=['country','geolocation'])

params = {'n_estimators':[70,80,100], 'max_samples':['auto'],
     'contamination':[0,0.001,0.010,0.1,0.5], 'max_features':[1,2],
     'bootstrap':[True,False],'n_jobs':[-1],
     'random_state':[None,1,], 'warm_start':[True]}  

def scorer_f(estimator, X):   
      return np.mean(estimator.score_samples(X))

isolation_forest = GridSearchCV(IsolationForest(), params, scoring=scorer_f)
model = isolation_forest.fit(dataset)

best_model = model.best_estimator_

predictions = best_model.predict([[0,0],[100,100]])
print(predictions)

output : [1,1]

The dataset generated in the above code is similar to the dataset I have after encoding. Even though the second point should clearly be an outlier, the model classifies it as an inlier.

What seems to be wrong with this approach? Thanks in advance!

Topic isolation-forest gridsearchcv unsupervised-learning anomaly-detection scikit-learn

Category Data Science

10xAI · Accepted Answer · 2021年2月19日 17:03

IsolationForest doesn't work on Euclidean distance. Hence [0,0] is almost as good as [100,100]

It builds random Trees on the dataset and expects that the Outlier will singled-out very early in the Tree while the Inliers will go deep. With that logic, it can figure out the Outlier.

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. [scikit-learn doc]

Your data is very clean and unique. If we observe here, we can differentiate [0,0] and [100,100] in a single split. So the depth of Node will be same for both cases.

You may try a euclidean distance-based Model i.e. LocalOutlierFactor

from sklearn.neighbors import LocalOutlierFactor

model = LocalOutlierFactor(n_neighbors=10,novelty=True).fit(dataset)

model.predict([[0,0],[100,100]])

Output - array([ 1, -1])

Scikit-Learn guide for Outliers/Novelty - Link

Geolocation Based Anomaly Detection in IPs Using Isolation Forest

About