Incorrect multi-variate anomaly detection - Isolation Forest Python

My data looks like below. it has 333 rows and 2 columns. Clearly the first row is anomaly.

ndf:

   +----+---------+-------------+
|    | ROW_CNT |    TOT_SALE |
+----+---------+-------------+
|  0 |      45 |     1411.27 |
+----+---------+-------------+
|  1 |   47754 |  1596200.68 |
+----+---------+-------------+
|  2 |  105894 |  3750304.55 |
+----+---------+-------------+
|  3 |  372953 | 14368324.86 |
+----+---------+-------------+
|  4 |  389915 | 14899302.85 |
+----+---------+-------------+
|  5 |  379473 | 14696309.67 |
+----+---------+-------------+
|  6 |  388571 | 14679457.93 |
+----+---------+-------------+
|  7 |  234409 |  8226472.95 |
+----+---------+-------------+
|  8 |   50587 |  1673114.75 |
+----+---------+-------------+
|  9 |  383779 | 14614106.80 |
+----+---------+-------------+
| 10 |  391525 | 14907049.92 |
+----+---------+-------------+
| 11 |  392012 | 13482471.85 |
+----+---------+-------------+
| 12 |  379081 | 14324222.03 |
+----+---------+-------------+
| 13 |  383681 | 14478162.98 |
+----+---------+-------------+
| 14 |  228857 |  7994892.44 |
+----+---------+-------------+

I am using below function to detect anomaly on 2 columns in the dataset:

def outlier_func(df):
    model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
    contamination='auto', max_features=1.0)  
    model.fit(df[['ROW_CNT', 'TOT_SALE']])
    df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
    df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
    anomaly = df.loc[df['anomaly'] == -1]
    anomaly_index = list(anomaly.index)
    return anomaly          

outlier_func(ndf)

What am i missing that it is incorrectly detecting the anomaly. Any help would be appreciated.

Topic isolation-forest ensemble-learning python-3.x anomaly-detection

Category Data Science


If you know beforehand the percentage of outliers present in your data, you should set the parameter contamination this will be the threshold used for the predict method to label the data as outlier/inlier. In your case it is ~ .06

Code:

isoForest = IsolationForest(random_state= 42, contamination= .06).fit(frame)
frame["outlier"] = isoForest.score_samples(frame[["row_cnt", "tot_sale"]])
frame["class"] = isoForest.predict(frame[["row_cnt", "tot_sale"]])

Outputs:

   row_cnt  tot_sale    outlier  class
0   45      1411.27     -0.682933  -1
1   47754   1596200.68  -0.532250   1
2   105894  3750304.55  -0.623476   1
3   372953  14368324.86 -0.486222   1
4   389915  14899302.85 -0.432155   1
5   379473  14696309.67 -0.426700   1
6   388571  14679457.93 -0.412645   1
7   234409  8226472.95  -0.575458   1
8   50587   1673114.75  -0.530636   1
9   383779  14614106.80 -0.408078   1
10  391525  14907049.92 -0.440645   1
11  392012  13482471.85 -0.561618   1
12  379081  14324222.03 -0.439916   1
13  383681  14478162.98 -0.417980   1
14  228857  7994892.44  -0.577208   1

One way to improve the efficiency of the prediction is to convert the dataframe type to float32. I got better result while converting the data type to float32.

df = pd.DataFrame(np.array([[45,1411.27],[47754,1596200.68],[105894,3750304.55],[372953,14368324.86],[389915,14899302.85]]),columns=['ROW_CNT','TOT_SALE'],dtype=np.float32)
def outlier_func(df):
    model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
    contamination='auto', max_features=1.0)  
    model.fit(df[['ROW_CNT', 'TOT_SALE']])
    df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
    df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
    anomaly = df.loc[df['anomaly'] == -1]
    anomaly_index = list(anomaly.index)
    return anomaly          

outlier_func(df)

enter image description here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.