What's the difference between GridSearchCrossValidation score and score on testset?

I'm doing classification using python. I'm using the class GridSearchCV, this class has the attribute best_score_ defined as Mean cross-validated score of the best_estimator.

With this class i can also compute the score over the test set using score.

Now, I understand the theoretical difference between the two values(one is computed in the cross validation, the other is computed on the test set), but how should I interpret them? For example, if in case 1 I get these values (respectively as best_score_ and as score on the test set) 0.9236840458731027 0.8483477781024932, and in case 2 these other values 0.8923046854943018 0.8733431353820776, which one should I prefer and why? Why can the difference between the two values ​​vary so much?

Topic grid-search gridsearchcv keras classification python

Category Data Science


You should select a model based on GridSearchCV result.

You should not select based on the test dataset score. Selecting model based on test score lowers the chance the model with generalize to unseen data. Test datasets should only be looked at once.

For the specific cases, you list, case 1 has the highest GridSearchCV result and that is the better model.

One possible reason the scores vary so much is the system has relatively high variance. There are many ways to lower variance - increase the amount data, apply dimensionality reductio, feature selection, change to different algorithm, and increase regularization.


Whenever you (or your computer) make decisions based on scores, those scores can no longer be relied upon as unbiased. So the best_score_, while it is based on scoring models on unseen-to-them data, is still an optimistically biased estimate of future performance. (An easy way to see this: if your hyperparameters have no effect except randomness, then choosing the highest-scoring one is obviously not actually better than the others, nor is that maximum value a good estimate of later performance.)

So your second option is better, having a higher (fresh) test score. Of course, if you had two choices up to this point and now you use the test score to select one, that is no longer an unbiased estimate of future performance!

All that said, usually best_score_ is reasonably close to test performance, especially if you don't have many hyperparameters to play with or if they have small impact on the modeling; your first option is a surprisingly large drop. One thing to consider is how large the test set is, and how representative it is. If your test set is too small to capture all the nuance of your population, but the training set is very large, then perhaps your test scores are more impacted by that noise and the cross-validated scores are actually more stable despite the selection bias.


Gridsearch cross-validation can be used to learn the hyper-parameters of a prediction function. Consider that learning and testing of the model on the same data is a big mistake. The chance of having a perfect score but failing to predict anything useful on yet-unseen data (i.e., overfitting) is very high when using the same data for learning and testing. It is common practice when training a model to hold out a part of the data as a test set to prevent overfitting and measure the performance of the model. Also, note that the best hyper-parameters can be determined by grid search techniques and the score resulted from grid search should not be used as a criterion to measure the performance of the model. Please refer to this page for more information

That being said, best_score_ from GridSearchCV is the mean cross-validated score of the best_estimator. For example, in the case of using 5-fold cross-validation, GridSearchCV divides the data into 5 folds and trains the model 5 times. Each time, it puts one fold aside and trains the model based on the remaining 4 folds. Then, it measures the performance of the model based on the left-out fold. Finally, it returns back the mean of the performance of 5 models as the final score.

Now, let's answer this question: what does the best estimator mean? it is the estimator that was chosen by the search or the estimator which gave the highest score (or smallest loss if specified) on the left-out data. GridSearchCV's goal is to find the optimal hyperparameters. It receives a range of parameters as input and it finds the best ones based on the mean score explained above. Grid search trains different models based on different combinations of the input parameters and finally returns the best model or the best estimator. Hence, best_score_ is the mean score of the best estimator. It is notable that tuning hyperparameters with cross-Validation in the above context is one of the methods that helps you to prevent overfitting.

In your case, 0.8923046854943018 is the mean score of the best estimator. Let's call this score cross-validation score. For your case, I would go with the second case, because in that case there is no overfitting and the cross-validation and test scores are almost the same. In the first case, the cross-validation is significantly higher than the unseen test score and there is overfitting. It means that the model works very well on the train data but not on the unseen data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.