Intuition behind the fact that SVM uses only measure of similarity between examples for classification

I have read about SVM and although I did not understand the math behind it completly, I know that it produces decision plane with maximum margin between examples of different classes and role of support vectors in the process. I also know that SVM is a kind of dual learing algorithm(algorithms that operate only using the dot product between examples). It uses kernel functions to calculate dot product(measure of similarity) between training examples.

What I want to understand in simple terms is that: Suppose I have a similarity matrix of all training examples specifying amount of similary between any(all) two examples in training sample. How Can I make a classifier or cluster based only on this information?

Topic kernel classification svm clustering machine-learning

Category Data Science


I have read about SVM ...I know that it produces decision plane

SVM does not explicitly produce a decision plane; it is not a parametric method. The decision plane implied by the fitted SVM can be visualized in two or three dimensions, but the plane merely results from the class labels and weights of the training observations. If the decision plane would have been estimated directly, each variable would obtain an estimated weight (and perhaps a non-linear transformation), but this would be a parametric approach. With SVMs, each training observation obtains an estimated weight, making it a non-parametric method.

What I want to understand in simple terms is that: Suppose I have a similarity matrix of all training examples specifying amount of similarity between any (all) two examples in training sample. How can I make a classifier or cluster based only on this information?

To make a classification, we compute the similarity of a new observation $x_0$ to all training observations (note that $x_0$ can actually be new, or just resampled from the training data). This gives us a weight for each of the training observations' class labels; we basically weigh and sum the training observations' class labels, to obtain the predicted class for $x_0$. For an SVM with linear kernel, this looks like:

$$f(\mathbf{x}_0) = \mathrm{sgn}\left( \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \cdot \mathbf{x}_0 + b \right)$$

Note that we can see $b$ as an intercept (estimated based on training observations), and $\alpha$ as weights of the training observations ($1, \dots, n$). $\mathbf{x}_i$ (a vector of predictor variable values) and $y_i$ (taking a value of -1 or 1) are simply observed in the training sample, and $\mathbf{x}_0$ are the observed predictor variable values for the new observation.


What I want to understand in simple terms is that: Suppose I have a similarity matrix of all training examples specifying amount of similary between any(all) two examples in training sample. How Can I make a classifier or cluster based only on this information?

Let me try to explain this in simple terms, There is something called Cosine similarity which is calculated between two vectors. It is defined as the angle (theta) between two vectors.

Case 1: Let's say we have two vectors a and b and let's assume the two vectors are perpendicular to each other and so the angle between them is 90 degree, So cos-sim(a,b) = cosine(90 degrees) which is 0. This means the similarity between vectors a and b is 0. The two vectors are highly dissimilar

Another related concept is Cosine distance which is used to define how far two vectors lie from its similarity value i.e what is the distance between them. There is a beautiful mathematical proof proving cos-dist(A,B)=1−cos-sim(A,B).

For our case, cos-dist(A,B) = 1−cos-sim(A,B) = 1−0 which equals 1. This means the distance between the two vectors is at its maximum

Case 2: When theta is 0 cos-sim(a,b) = cosine(0 degrees) which is 1. This means the similarity between vectors a and b is 1. The two vectors are highly similar and now the cos-dist(A,B) = 1−cos-sim(A,B) = 1−1 which equals 0. This means the distance between the two vectors is at its minimum.

Now consider each of your data point as some n-dimensional vector and someone gives you a matrix with cosine similarity values of every point with respect to every point in your data. Given that you have similarity values, you can calculate the distance between any point to any other point thereby forming clusters of points which are similar. If you have more than 2 clusters, you can work out this problem as a multiclass classification problem. Hope this clears your doubt


You can try running SVM just on this similarity matrix. But you'll then need to provide the sikikaritirs also for new data points.

Furthermore, SVMs rely on the similarities bring dot products in some vector space. If they aren't you may get inconsistencies. They may rely on the triangle inequality being satisfied for the distance function d(x,y)=sqrt(2k(x,y)-k(x,x)-k(y,y)). Although I cannot find a clear reference on whether or not this is needed. If k is a scalar product in some vector space, this obviously is satisfied.

Last but not least, SVMs are good for larger amounts of data, where you cannot afford to keep the entire similarity matrix in memory! By reducing the data set to the support vectors only, the resulting classifier will need much less memory and much less time. Much of the challenge of learning a SVM is to manage memory during training.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.