Decision boundary in a classification task

I have 1000 data points from the bivariate normal distribution $\mathcal{N}$ with mean $(0,0)$ and variance $\sigma_1^2=\sigma_2^2=10$ with the covariances being $0$. Also there are 20 more points from another bivariate normal distibution with mean $(15,15)$ with variance $\sigma_1^2=\sigma_2^2=1$ and with the covariances being $0$ again. I used the least squares method to calculate the parameters of the decision bounday $\theta_0 + \theta_1 x_1 + \theta_2 x_2=0$, that is $$\theta = (X^T X)^{-1}(X^Ty)$$ where $y$ is a column matrix with labels $+1$ for points from the first class and $-1$ for points from the second. The resuliting plot is as follows:

It is obvious that the decision boundary failed to be correct, as it passes right through class $-1$ and therefore it won't classify correctly future points that might stem from the same distribution. Now, there is the issue of why this is happenning. I understand that the main problem here is the imbalance of the data set, as there are $1000$ points from one class but only $20$ from the other. This, intuitively, makes sense.

What I want someone to help me with, if possible, is to understand how this imbalance problem is incorporated into the process of minimizing the least squares cost function $$J(\theta)=\sum_{n=1}^{200}(y_n-\theta^T x_n)^2$$

How does the fact that there are only $20$ points from the second class causes the minimzation task $\frac{\partial J(\theta)}{\partial \theta}=0$ to fail? How do the insufficient amount of these points causes this line to pass right through them? If there is some mathematical way to show me this, it would be nice, as I already got the intuition.

Topic linearly-separable mathematics classification machine-learning

Category Data Science


I contend that this is a feature, not a bug.

Going into the classification, not knowing the values of $x_1$ or $x_2$, it is much more likely that your point belongs to $+1$ than $-1$. Consequently, you shouldn’t just need decent evidence that a point is $-1$. You should need overwhelming evidence.

The red $+1$ group, loosely speaking, exists in the square $[-10,10]\times[-10,10]$. The closest blue $-1$ point is at about $(12,15)$, which is not all that far from the $+1$ zone. The decision boundary is telling you that $(12,15)$ is not sufficiently far from the $+1$ zone to overcome the high “prior” probability of being $+1$. To get sufficiently far from the $+1$ zone not to be classified as $+1$, you need to be above about $(15,17)$.

If you simulate $100$ and then $200$ and then $500$ and then $1000$ blue $-1$ points to go along with that same $1000$ red $+1$ points, you will see the decision boundary drift towards where you would expect it to be in between the two groups.

You can do more with this idea of “prior” (and “posterior”) probability if you use a logistic regression to predict class membership probabilities. While this might warrant a new question, it might be more in line with the “mathematical” explanation that you want.


For those data points, a threshold just on x1 axis would perfectly separate the two distributions. You could fit a decision stump to calculate the single parameter of the decision boundary.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.