Massive difference in accuracy of KNN depending on random_state

pardon the noob question but I am baffled by the following behavior. My model has MASSIVELY different results based on the random seed. I want to train a KNN classifier on the famous Kaggle Titanic problem where we attempt to predict survival or not. I focus only on the Sex feature, to make things easier.

The problem becomes now that by changing the random seed the results of the accuracy change incredibly. For example, one random seed gives me a score of 0.78 and another random seed gives me a score of 0.17, and different random seeds give more or less everything in between. How can this significant change in score behavior be explained? Also, why does this change in accuracy become less significant when the n_neighbors = 2 or above? Thanks in advance! Here is the code in question.

df_sex = df[[Sex]]
y = df[Survived] 

def training(state):
    X_train, X_test, y_train, y_test = train_test_split(df_sex, y, random_state=state )
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train, y_train)

    print(Test set score: {:.2f}.format(knn.score(X_test, y_test)))
    print(Train set score: {:.2f}.format(knn.score(X_train, y_train)))

training(0)
training(3)
training(10)
training(21)

gives output

Test set score: 0.78
Train set score: 0.79
Test set score: 0.22
Train set score: 0.21
Test set score: 0.17
Train set score: 0.23
Test set score: 0.77
Train set score: 0.79

Topic k-nn kaggle machine-learning

Category Data Science


Setting k=1 in the k-nearest neighbors algorithm (k-NN) makes a k-NN a high variance machine learning algorithm. It is only looking at a single neighbor to make its prediction. As you increase k, the variance will go down because it will use majority voting across more examples.


$k$NN is the most simple supervised learning algorithm: given an input instance $x$, it finds the $k$ instances most similar to $x$ in the training set, and then predicts $x$ as the majority class between these $k$ instances.

You use a single boolean feature to represent an instance, which means that an instance can only be true or false. The concept of "most similar" has very little sense in this case: a true instance is "equally most similar" to all the training instances which also have true for gender. This means that for every prediction $k$NN has to pick the $k$ most similar instances randomly among all the instances with the same gender value, and consequently the target class is practically random as well... And this is why the performance is almost entirely determined by the random state.

This would be less likely to happen when using multiple and/or more complex features, since strict equality of all the features would not happen in general.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.