Understanding the math behind linear classification
For example we have $X$ train data, $y$ and $w$
Our margin is $M = y_i \langle w, x_i \rangle$
If $M_i 0$ classifier return True predict and otherwise, if $M_i 0$ we get False predict.
How does it work? $y_i = \langle w, x_i \rangle$ , it means they have same sign, and if we multiply them, the multiply product will be always positive, because plus * plus = plus
and minus * minus = plus
. Otherwise it will be False
Let's have a function $L(M) = [M 0]$
And what author of the course is suggesting is to create a function upper bound and then minimize it, in the way that we can't minimize plane $L(M)$
And here comes sigmoid or any other function
And now author says that if we are able to minimize upper bound function, then we will minimize $L(M)$ and it sounds good to me, but still I have no idea how it will minimize the initial $L(M)$.
Because it looks like the upper bound function is symmetrical and if we are going to change the argument of the function, the area under function will remain the same.