Incorrect multi-variate anomaly detection - Isolation Forest Python

Question

Incorrect multi-variate anomaly detection - Isolation Forest Python

The AG

2022年3月30日 02:04

My data looks like below. it has 333 rows and 2 columns. Clearly the first row is anomaly.

ndf:

   +----+---------+-------------+
|    | ROW_CNT |    TOT_SALE |
+----+---------+-------------+
|  0 |      45 |     1411.27 |
+----+---------+-------------+
|  1 |   47754 |  1596200.68 |
+----+---------+-------------+
|  2 |  105894 |  3750304.55 |
+----+---------+-------------+
|  3 |  372953 | 14368324.86 |
+----+---------+-------------+
|  4 |  389915 | 14899302.85 |
+----+---------+-------------+
|  5 |  379473 | 14696309.67 |
+----+---------+-------------+
|  6 |  388571 | 14679457.93 |
+----+---------+-------------+
|  7 |  234409 |  8226472.95 |
+----+---------+-------------+
|  8 |   50587 |  1673114.75 |
+----+---------+-------------+
|  9 |  383779 | 14614106.80 |
+----+---------+-------------+
| 10 |  391525 | 14907049.92 |
+----+---------+-------------+
| 11 |  392012 | 13482471.85 |
+----+---------+-------------+
| 12 |  379081 | 14324222.03 |
+----+---------+-------------+
| 13 |  383681 | 14478162.98 |
+----+---------+-------------+
| 14 |  228857 |  7994892.44 |
+----+---------+-------------+

I am using below function to detect anomaly on 2 columns in the dataset:

def outlier_func(df):
    model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
    contamination='auto', max_features=1.0)  
    model.fit(df[['ROW_CNT', 'TOT_SALE']])
    df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
    df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
    anomaly = df.loc[df['anomaly'] == -1]
    anomaly_index = list(anomaly.index)
    return anomaly          

outlier_func(ndf)

What am i missing that it is incorrectly detecting the anomaly. Any help would be appreciated.

Topic isolation-forest ensemble-learning python-3.x anomaly-detection

Category Data Science

Multivac · Accepted Answer · 2022年2月26日 03:21

If you know beforehand the percentage of outliers present in your data, you should set the parameter contamination this will be the threshold used for the predict method to label the data as outlier/inlier. In your case it is ~ .06

Code:

isoForest = IsolationForest(random_state= 42, contamination= .06).fit(frame)
frame["outlier"] = isoForest.score_samples(frame[["row_cnt", "tot_sale"]])
frame["class"] = isoForest.predict(frame[["row_cnt", "tot_sale"]])

Outputs:

   row_cnt  tot_sale    outlier  class
0   45      1411.27     -0.682933  -1
1   47754   1596200.68  -0.532250   1
2   105894  3750304.55  -0.623476   1
3   372953  14368324.86 -0.486222   1
4   389915  14899302.85 -0.432155   1
5   379473  14696309.67 -0.426700   1
6   388571  14679457.93 -0.412645   1
7   234409  8226472.95  -0.575458   1
8   50587   1673114.75  -0.530636   1
9   383779  14614106.80 -0.408078   1
10  391525  14907049.92 -0.440645   1
11  392012  13482471.85 -0.561618   1
12  379081  14324222.03 -0.439916   1
13  383681  14478162.98 -0.417980   1
14  228857  7994892.44  -0.577208   1

Sebin Sunny · Accepted Answer · 2020年9月25日 17:14

One way to improve the efficiency of the prediction is to convert the dataframe type to float32. I got better result while converting the data type to float32.

df = pd.DataFrame(np.array([[45,1411.27],[47754,1596200.68],[105894,3750304.55],[372953,14368324.86],[389915,14899302.85]]),columns=['ROW_CNT','TOT_SALE'],dtype=np.float32)
def outlier_func(df):
    model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
    contamination='auto', max_features=1.0)  
    model.fit(df[['ROW_CNT', 'TOT_SALE']])
    df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
    df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
    anomaly = df.loc[df['anomaly'] == -1]
    anomaly_index = list(anomaly.index)
    return anomaly          

outlier_func(df)

Incorrect multi-variate anomaly detection - Isolation Forest Python

About