Linear Discriminant - Least Squares Classification Bishop 4.1.3

Pls. refer section 4.1.3 in Pattern Recognition - Bishop: Least squares for Classification:

In a 2 class Linear Discriminat system, we classified vector $\mathbf{x}$ as $\mathcal{C}_1$ if y($\bf{x}$)0, and $\mathcal{C}_2$ otherwise. Generalizing in section 4.1.3, we define $\mathcal{K}$ linear discriminant equations - one for each class:

$y_{k}(\bf{x}) = \bf{w_k^Tx} + \mathit{w_{k0}} \tag {4.13}$

adding a leading 1 to vector $\bf{x}$ yields $\tilde{\mathbf{x}}$.

And the Linear Discriminant function for $\mathcal{K}$ class is given by: $\bf y(x) = \widetilde{W}^{T}\tilde{x}$. The author progresses and presents sum of squares Error function as:

$E_D(\widetilde{W}) = \frac{1}{2}Tr\{(\tilde{X}\widetilde{W} - T)^T(\tilde{X}\widetilde{W} - T)\} \tag {4.15}$

My doubts are related to above equation 4.15.

Consider a 3-class system with only one observation, $\bf{x}\in \mathcal{C_2}$, my understanding:

  1. Pls. refer $\bf{Y}$ in upper half of diagram. Will only $val(\mathcal{C_2})$ be positive: $\mathbf{x} \in \mathcal{C_2}$, $y_{2}(\bf{x})$ $\it{threshold}(\mathcal{C_2})$. Is the value, $val(\mathcal{C_k})$, negative for other classes' Discriminant functions? If not, could you briefly explain the reason?
  2. The error matrix, $\bf{E}$ is 1x3 matrix. $\bf{E}^{T}E$ will be a 3x3 matrix, with diagonal elements representing squared(Error) for a class. Does $Tr$ in 4.15 stand for $trace$ - sum of diagonal elements? If so, why do we ignore off diagonal error values/ why don't they matter?

P.S.: If my understanding is wrong/ grossly wrong, I'll appreciate if you point out the same.

Topic discriminant-analysis machine-learning

Category Data Science


As Bishop points out throughout that section, least squares is ill-equipped for this problem, so maybe we shouldn't spend too much time understanding it. On the other hand, clearing up misconceptions here may help elsewhere.

  1. We would like for (i.e., least squares strives for) $\operatorname{val}(\mathcal{C}_2)$ to be close to 1 and the others close to 0. But the values could very well be negative, positive, greater than 1, for any of the classes (the correct ones or not)! Our final classification though is dealt with as Section 4.1.2 explains (just after equation 4.9): the class with largest $y$-value wins.
  2. Yes, here $\operatorname{Tr}$ means the trace. In "least squares," we're minimizing the mean squared error. It happens that the sum of squared errors can be written nicely as the trace of these matrices, but the incidental entries of the matrix off the diagonal don't concern us here.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.