Many separation line using RBF kernel in SVM

Question

Many separation line using RBF kernel in SVM

E199504

2022年4月17日 11:00

Below is my code, it take a range of a number, creates a new column label that contains either -1 or 1.

In case the number is higher than 14000 , we label it with -1 (outlier) In case the number is lower than 14000 , we label it with 1 (normal)

## Here I just import all the libraries and import the column with my dataset 
## Yes, I am trying to find anomalies using only the data from one column

df['label'] = [-1 if x = 14000 else 1 for x in df['data_numbers']]  #What I explained above

data = df.drop('label',axis=1)                         
target = df['label']
outliers = df[df['label']==-1]

outliers = outliers.drop('label',axis=1)

from sklearn.model_selection import train_test_split
train_data, test_data, train_target, test_target = train_test_split(data, target, train_size = 0.8)
train_data.shape

nu = outliers.shape[0] / target.shape[0]
print("nu", nu)

model = svm.OneClassSVM(nu=nu, kernel='rbf', gamma=0.00005) 
model.fit(train_data)

from sklearn import metrics
preds = model.predict(train_data)
targs = train_target 
print("accuracy: ", metrics.accuracy_score(targs, preds))
print("precision: ", metrics.precision_score(targs, preds)) 
print("recall: ", metrics.recall_score(targs, preds))
print("f1: ", metrics.f1_score(targs, preds))
print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))
train_preds = preds

preds = model.predict(test_data)
targs = test_target 
print("accuracy: ", metrics.accuracy_score(targs, preds))
print("precision: ", metrics.precision_score(targs, preds)) 
print("recall: ", metrics.recall_score(targs, preds))
print("f1: ", metrics.f1_score(targs, preds))
print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))
test_preds = preds


from mlxtend.plotting import plot_decision_regions                                 # as rbf svm is used hence lot's of  decision boundaries are drawn unlike one in linear SVM 
# the top one central points with blue quares are outlietrs while at the bottom they are orangy triangles(normal values)
plot_decision_regions(np.array(train_data), np.array(train_target), model)
plt.show()

Output from training data

accuracy:  0.9050484526414505
precision:  0.9974137931034482
recall:  0.907095256762054
f1:  0.9501129131595154
area under curve (auc):  0.5876939698444417

Output from test data

accuracy:  0.9043451078462019
precision:  1.0
recall:  0.9040752351097179
f1:  0.9496213368455713
area under curve (auc):  0.9520376175548589

My graph seems to be having so many sepearation lines, I was thinking I would only be getting one that differentiates between the outliers and the normal data.

Topic kernel anomaly-detection outlier svm python

Category Data Science

Ta_Req · Accepted Answer · 2020年3月13日 16:14

If you know the decision boundary(14000) then why you need an ML algorithm? You can apply an if the condition on that. ML algorithms are developed for finding a decision boundary? If you really want to experiment on this dataset then you have to know what type of classification you are doing? From the artificially labeled dataset, I can see you have two classes(1, -1). But you are using OneClassSVM. That's the mistake! Just remove OneClassSVM and use normal linear svm. And no need to use 'rbf'. 'rbf' is for nonlinear separation. Your classification boundary is linear. It just a line going through 14000.

from sklearn import svm
clf = svm.SVC()
clf.fit(train_data, train_target)

Note: If you want to do outlier detection then don't label the data by (< 14000) condition. Allow the OneClassSVM to find the outlier by looking training features only.

A Co · Accepted Answer · 2020年3月13日 16:13

The OneClassSVM is an unsupervised algorithm that is supposed to learn the normal data distribution. This means that the algorithm will model the boundaries of areas of high likelihood for your data points to be drawn from.

Even though you know which points are outliers, the SVM will only know that there is roughly a proportion of outliers of nu, therefore the 14000 boundary is not intuitive at all in terms of distribution.

What the SVM will do that explains your multiple lines is:

If you have a hole in what you consider the "normal zone" - let's say as an example you have no data points between 2560 and 2985 - the SVM can decide that this is an area of low likelihood for your distribution and therefore build two vectors to exclude it from the learned normal distribution.
At the opposite, if you have several points clustered together above 14000, since the SVM has no clue this is supposed to be the anomalous zone, it can detect a zone of high likelihood for the data points to be drawn from, and build two vectors to include it in the learned normal distribution.

Now regarding your metrics:

You probably have an imbalanced dataset since we are talking about anomaly detection (otherwise it makes no sense!), so you should not use global accuracy here because it is misleading.
There is one thing that should alert you: The train AUC is very low, while the test AUC very high and this is not normal. So your algorithm poorly learns the target normal distribution.

The high test AUC can be explained, but first let's use some notations:

The real normal data distribution will be called *the target normal distribution", in the sense that you want to learn it. It is the interval [0;14000[
The real anomalous area (cannot talk about a distribution in this setting) will be called "the target anomaly area". It is the interval [14000;+infinity[
The learned normal distribution will be called the "learned normal distribution", obviously.
The learned anomalous area will be called the "learned anomalous area"

So the learned distribution consists in small clusters around areas of higher training data density. Since you (supposedly) have a few outliers and a lot of normal points, the learned distribution still intersects well with the target normal distribution, but not with the target anomalous area, where only a few areas will belong to the learned normal distribution. You can verify this on your graph: there are only a few vectors in the target anomalous area.

Now when you apply the algorithm on the test set, it is not likely that a point from the test set in the target anomalous area falls into the intersection of the learned normal distribution and the target anomalous area, because this intersection is very small. But this is only due to the small number of data points in this anomalous area, not to a successful learning process. Hence these points will still be labeled as anomalies and you get a misleading high AUC score.

At the contrary, the target normal distribution and the learned normal distribution intersects correctly, because there was more data points to learn from. So most of the normal points in the test set are classified correctly.

To sum up, your algorithm learns a completely different distribution from what you excpected. You can try to use a higher gamma value to smooth the kernels. but you will probably not avoid the classification of the areas of higher density above 14000 as normal data points. So watch out for the train AUC!

It would be very interesting to investigate these statements with graphs, you could as an example plot the [14000; +infinity[ zone, highlight the decision function boundaries and plot train and test set data points. You should get confirmation of what I am saying above (I will do it later and add the graphs if I find some time!)

Many separation line using RBF kernel in SVM

About