Massive difference in accuracy of KNN depending on random_state
pardon the noob question but I am baffled by the following behavior. My model has MASSIVELY different results based on the random seed. I want to train a KNN classifier on the famous Kaggle Titanic problem where we attempt to predict survival or not. I focus only on the Sex feature, to make things easier.
The problem becomes now that by changing the random seed the results of the accuracy change incredibly. For example, one random seed gives me a score of 0.78 and another random seed gives me a score of 0.17, and different random seeds give more or less everything in between. How can this significant change in score behavior be explained? Also, why does this change in accuracy become less significant when the n_neighbors = 2 or above? Thanks in advance! Here is the code in question.
df_sex = df[[Sex]]
y = df[Survived]
def training(state):
X_train, X_test, y_train, y_test = train_test_split(df_sex, y, random_state=state )
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print(Test set score: {:.2f}.format(knn.score(X_test, y_test)))
print(Train set score: {:.2f}.format(knn.score(X_train, y_train)))
training(0)
training(3)
training(10)
training(21)
gives output
Test set score: 0.78
Train set score: 0.79
Test set score: 0.22
Train set score: 0.21
Test set score: 0.17
Train set score: 0.23
Test set score: 0.77
Train set score: 0.79
Topic k-nn kaggle machine-learning
Category Data Science