SVM - Making sense of distance derivation

I am studying the math behind SVM. The following question is about a small but important detail during the SVM derivation.

The question

Why the distance between the hyperplane $w*x+b=0$ and data point (in vector form) $p$, $d = \frac{w * p + b}{||w||}$ can be simplified to $d = \frac{1}{||w||}$?

My argument

Since data point $p$ is not on the hyperplane, then we have $w*p+b=k, k \ne 0$. Then $d=\frac{k}{||w||}$ but $k$ is not a constant as it depends on $w$. So optimizing $d=\frac{k}{||w||}$ is not the same as optimizing $d = \frac{1}{||w||}$.

Mentions of the distance calculation step in materials:

  1. page 4 of https://www.ugpti.org/smartse/resources/downloads/support-vector-machines.pdf
  2. slide 10 of http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
  3. slide 11 of https://www.cs.upc.edu/~mmartin/SVMs.pdf

Topic derivation mathematics svm

Category Data Science


I believe I have found the answer. A great thank-you to @Javier TG for pointing me to Andrew Ng's notes CS229 Suppor Vector Machines. I would not write down the full context as materials in the lecture notes and questions references do a much better job than I ever will. I will, however, try to fill in the tiny logic gap in many notes: why $w*x+b=1$ can represent any $w*x+b=\lambda, \lambda \ne 0$.

The key understanding is that the distance $d_i$from a point $x_i$ to the plane $w*x+b=0$ is invariant to scaling of $w$ and $b$. Below is a proof.

From the notes, we derived the distance as $$d_i=\frac{y_i(w*x_i+b)}{\|w\|}$$ This shows that $d_i$ is invariant to scaling of $w, b$. Meaning replacing $w, b$ with $2w, 2b$ gives the same $d_i$. You should see it for yourself. Then it is straightforward that the maximum of $d_i$, $$ d=\max_{w} d_i, i=1,...,n $$ Is also invariant to the scaling.

However, let $l_i$ represent the nominator in the distance equation, $l_i = y_i(w*x_i+b)$. You will see $l_i$ scales at the same magnitude as that of $w, b$. Meaning replacing $w, b$ with $2w, 2b$ gives $2l_i$. You should also see it for yourself. Then the maximum of $l_i$, $l$ as $$ l=\max_{w}l_i, i=1,...,n $$ scales with $w, b$.

Now revisit the optimization problem derived in the notes, $$ \begin{aligned} \max_{w,b} \quad & d\\ \textrm{s.t.} \quad & y_i(w*x_i+b)\geq l, i=1,...,n \end{aligned} $$

It is okay that we tweak $w, b$ to find the optimum solution, but we also got $d, l$ in the problem statement which makes the problem more complex. It will be great if we can get rid of $d, l$. We do this by leveraging the aforementioned findings. With $l$ scales with $w,b$, we can require that $l'=l*k=1$. To maintain the inequality constraint, we do the same scaling on the left hand side. $$y_i(kw*x_i+kb)\geq kl=1$$. Since $d$ is invariant to scaling, the objective function is not changed. We are still solving the same optimization problem.

Finally, we replace $w=kw, b=kb$, and $d=\frac{l}{\|w\|}=\frac{1}{\|w\|}$, and get the form of the same optimization problem stated in the notes: $$ \begin{aligned} \max_{w,b} \quad & \frac{1}{\|w\|}\\ \textrm{s.t.} \quad & y_i(w*x_i+b)\geq 1, i=1,...,n \end{aligned} $$

Now we know, in our optimization problem, any constraint $y_i(w*x_i+b)\geq l$ is equivalent to $y_i(w*x_i+b)\geq 1$.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.