Geolocation Based Anomaly Detection in IPs Using Isolation Forest
I'm trying to detect anomalies based on geolocation from IP addresses on a server access log file. I have created two features country and geo_velocity, using the IP address and the timestamp of each request. However, since all the requests are from stationary clients and all the clients are from one country in the log file I have, my dataset ends up looking something like this.
| Country | geo_velocity|
| ----------- | ----------- |
| USA | 0 |
| USA | 0 |
| USA | 0 |
Basically, If I plot the whole dataset in a scatterplot, it would condense onto a single point. Therefore, literally, every other value to these features should be an anomaly for this dataset.
I used Isolation Forest with GridSearchCV to tune hyperparameters. And for the scoring parameter in GridSearchCV, I used a custom scoring function as shown in the code below. The problem is, the model classifies everything as inliers even though they should be clearly outliers.
from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
# this is similar to my dataset after label encoding the country
data = np.zeros((5000,2),dtype=int)
dataset = pd.DataFrame(data, columns=['country','geolocation'])
params = {'n_estimators':[70,80,100], 'max_samples':['auto'],
'contamination':[0,0.001,0.010,0.1,0.5], 'max_features':[1,2],
'random_state':[None,1,], 'warm_start':[True]}
def scorer_f(estimator, X):
return np.mean(estimator.score_samples(X))
isolation_forest = GridSearchCV(IsolationForest(), params, scoring=scorer_f)
model =
best_model = model.best_estimator_
predictions = best_model.predict([[0,0],[100,100]])
output : [1,1]
The dataset generated in the above code is similar to the dataset I have after encoding. Even though the second point should clearly be an outlier, the model classifies it as an inlier.
What seems to be wrong with this approach? Thanks in advance!