random_state on train_test_split() appears to have large effect in performance metrics?
To summarize the problem: I have a data set with ~1450 samples, 19 features and a binary outcome where classes are fairly balanced (0.51 to 0.49).
I split the data into a train set and a test set using train_test_split(X, Y, test_size = 0.30, random_state = 42).
I am using the train set to tune hyper-parameters in algorithms optimizing for specificity, using GridSearchCV, with a RepeatedStratifiedKFold (10 splits, 3 repeats) cross-validation, and scoring=make_scorer(recall_score, pos_label=0). I am then using the predictions on the test set to evaluate several metrics including specificity.
I started with a linear SVM (adjusting C), just to test out the process, and to my surprise the, grid_search.best_score_ is ~0.67, but while evaluating the test set it was ~0.55, it seemed like a large difference to me and I was unsure what was happening. I did a lot of troubleshooting to try and eliminate any PEBMAC, and as a last resort I decided to take the same code and just loop through random_state/seed in train_test_split(), and I calculated the difference between .best_score_ and the test score, turns out:
It appears that by using random_state=42, I unknowingly chose the random state (between 0 and 99) with the largest difference between the grid search best score, and the specificity in test set predictions.
While this difference worries me, what worries me more is that a lot seems to vary with the random state, including the best value of C, (and other metrics such as accuracy (10% variation between min and max), precision (14% variation between min and max), recall (17% variation between min and max), ...), making me doubt how to actually evaluate and report the best hyperparameters, as well as metrics for this model, since the choice of how it is split seems to have such a major influence.
Is there a better way to get actually representative metrics of the model? (that don't greatly vary with the way it is split?). Is my method just flawed, and there is a gold standard to this that I am missing?
Sorry for the wall of text!
Thanks!
Edit:
Considering Erwan's comment that optimizing for specificity might be contributing to the problem (and it still might), I ran the same experiment optimizing for accuracy, striking differences are still apparent:
Coincidentally 42 still seems to be a fairly unlucky random state in terms of the difference between best accuracy score and test accuracy score.
Topic gridsearchcv scikit-learn svm python machine-learning
Category Data Science