Ideal difference in the training accuracy and testing accuracy

In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?

Topic training data supervised-learning accuracy classification

Category Data Science


A difference of 5% is fine. Try using cross-validation and verify with mean accuracies.

Empirically good settings for performing k-cv k=10 stratify the dataset on the target attribute.

Also please try to see if your dataset is balanced.


Theoretically speaking, in a perfect scenario, training and test data both represent the distribution of your problem accurately. Therefore, in an ideal case, training and testing should not have any significant differences in accuracy. This becomes more and more true when you have lots of data.

A difference of 5% is perfectly fine. In practice, it is common that training accuracy is slightly better than the test accuracy. I will say that the difference may not be the best indicator. What you should look at is correlation. Meaning that, as long as training and testing accuracy improve together at a similar rate, you're in the clear, regardless of how far apart they are. You can investigate that by training and evaluating on increasingly bigger subsets of the data. Ideally, training and testing should both improve as you add data. If test data starts decreasing, you have overfitting.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.