Comparison of classifier confusion matrices

I tried implementing Logistic regression, Linear Discriminant Analysis and KNN for the smarket dataset provided in An Introduction to Statistical Learning in python.

Logistic Regression and LDA was pretty straight forward in terms of implementation. Here are the confusion matrices on a test dataset.

Both of them are pretty similar with almost same accuracy. But I tried finding a K for KNN by plotting the loss vs K graph:

and chose a K around 125 to get this confusion matrix (same test dataset)

Although the KNN gave a higher accuracy of around 0.61, the confusion matrix is very different from logistic and LDA matrices with a much higher true negative and a low true positive. I cant really understand why this is happening. Any help would be appreciated.

Here is how I computed loss for KNN Classifier (using Sklearn). Could not use MSE since the Y values are qualitative.

k_set = np.linspace(1,200, dtype=int)
knn_dict = {}

for k in k_set:
    model = KNeighborsClassifier(k)
    model.fit(train_X, train_Y)
    y_pred = model.predict(test_X)
    loss = 1 - metrics.accuracy_score(test_Y, y_pred)
    knn_dict[k] = loss

model = KNeighborsClassifier(K)
model.fit(train_X, train_Y)
knn_y_pred = model.predict(test_X)

knn_cnf_matrix = metrics.confusion_matrix(test_Y, knn_y_pred)

Very new to data science. I hope I have provided enough background/context. Let me know if more info is needed.

Topic k-nn lda-classifier logistic-regression confusion-matrix python

Category Data Science


A few comments:

  • I don't know this dataset but it seems to be a difficult one to classify since the performance is not much better than a random baseline (the random baseline in binary classification gives 50% accuracy, since it guesses right half the time).
  • If I'm not mistaken the majority class (class 1) has 141 instances out of 252, i.e. 56% (btw the numbers are not easily readable in the matrices). This means that a classifier which automatically assigns class 1 would reach 56% accuracy. This is called the majority baseline, this is usually the minimal performance one wants to reach with a binary classifier. The LR and LDA classifiers are worse than this, so practically they don't really work.
  • The k-NN classifier appears to give better results indeed, and importantly above 56% so it actually "learns" something useful.
  • It's a bit strange that the first 2 classifers predict class 0 more often than class 1. It looks as if the training set and test set don't have the same distribution.
  • the k-NN classifier correcly predicts class 1 more often, and that's why it works better. k-NN is also much less sensitive to the data distribution: in case it differs between training and test set, this could explain the difference with the first 2 classifiers.
  • However it's rarely meaningful for the $k$ in $k$-NN to be this high (125). Normally it should be a low value, like one digit only. I'm not sure what this means in this case.
  • Suggestion: you could try some more robust classifiers like decision trees (or random forests) or SVM.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.