Hinge Loss understanding and proof
I hope this doesn't come off as a silly question, but I am looking at SVMs and in principle I understand how they work. The idea is to maximize the margin between different classes of point (within any dimension) as much as possible.
So to understand the internal workings of the SVM classification algorithm, I decided to study the cost function, or the Hinge Loss, first and get an understanding of it...
$$L=\frac{1}{N} \sum_{i} \sum_{j \neq y_{i}}\left[\max \left(0, f\left(x_{i} ; W\right)_{j}-f\left(x_{i} ; W\right)_{y_{i}}+\Delta\right)\right]+\lambda \sum_{k} \sum_{l} W_{k, l}^{2}$$
Interpreting what the equation means is not so bad. For all classes we find the difference between all classes and the class we want to interpret (with a difference of delta) and sum all of them that are greater than 0. By looking at this it seems we are trying to figure out how off a classification is for every point.
However, I do not actually understand why the Hinge Loss works, and why it is useful for SVMs. I cannot find any proofs of the Hinge Loss, nor any visual representation as to why it works.
I understand that we are trying to minimize the Hinge Loss during training, but I have no idea how the equation of the Hinge Loss came to be, and why it works. I am looking for some kind of mathematical proof of the Hinge Loss in relation to SVMs.
Topic hinge-loss loss-function svm
Category Data Science