What is the minimum size of the test set?

The mean of a population of binary values can be sampled with about 1000 samples at 95% confidence, and 3000 samples at 99% confidence.

Assuming a binary classification problem, why is the 80/20% rule always used, and not the fact that with a few thousand samples the mean accuracy can be estimated with > 95% confidence?

Topic ab-test cross-validation statistics

Category Data Science


For classification problems, check https://pdfs.semanticscholar.org/b10d/33a34a21d8806cb35509d6a79ff7827a4b24.pdf

The number of test examples N needed depends on the error rate E you expect to get and the error bar you want. The idea is that you need to have seen enough errors to get a small enough error bar (if you don't see any error, you cannot get any precision). The rule of thumb (Formula 17):

 total number of errors seen = 100 = N E.
 So your size test set will be N = 100/E. 

Qualitative explanation (see paper for more details): If your error rate is E and your size test set is N, the error bar is proportional to:

 sigma = sqrt(E(1-E)/N) 

 ~= sqrt(E/N) if you have a small error rate.

Imagine that you want to use a 2 sigma error bar (~95% confidence interval) DeltaE = 2 sigma. Assume you want a relative error DeltaE/E = 0.1 (one significant digit). Then:

 DeltaE/E ~= sqrt(E/N)/E = 0.1

Solving for N yields your the size test set:

 N = 100/E

This is a great question since it is illuminating to examination the best practices of both about traditional statistics and machine learning as they are brought together.

Two Separate Rules/Best Practices -

First, the two items that you mentioned should be examined separately and not conflated i.e. they should be carefully combined as you suggest in your question.

Significance: You have estimated that you want greater than 1000 cases to have statistical significance and would further benefit from greater than 3000 test cases.

Cross validation: Cross validation is performed by splitting the data (often 80% train-20% test) so that tests for bias (~underfitting) and variance (~overfitting) can be assessed.

Combining significance with cross validation: Now we know that we want to have significance in our tests so we want greater than 3000 records. We also want to perform cross validation, so in order for both the testing and training data to return significant results we want both to have a minimum of 3000 records. The best scenario for this would be to have 15,000 total records. This way, the data can be split 80/20 and the testing set is still significant.

Lets assume you only have 4000 records in your data set. In this case, I would opt to make my training data significant while allowing the testing set to drop to lower significance.

More rigor: Everything above has been quite hand wavey and lacks statistical rigor. For this, I refer you to a couple of papers referenced in another Stack Exchange question -

Dietterich, "Approximate Statistical Tests for Comparing Supervised Classication Learning Algorithms"

Salzberg, Data Mining and Knowledge Discovery, 1, 317–327 (1997), "On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach".

Hope this helps!


Maybe some of your best predictors doesn't occur very often (sparse) and if you have 10 million rows, and only take 3000 samples you don't get any of those samples with that really important sparse outcomes. You could have zip code as a predictor, and with over 43,000 unique zip codes in the US associated with the 10 million samples, and your sample of 3,000 could get about 7% of the unique zip codes at best. That would be tough to get an accurate measure of error with.

The reason for using a test set whose size is relative to the data (be it 20% or 30% holdout, or 10-fold cross validation) is to have a standard and more robust measure of error than just a fixed number of samples.

The only reason why I think you would be tempted to use a few thousand samples as a test set, is to use more data in your model building procedure. If that is the case, cross validation like 10-fold would be more tempting since it uses all of your data, and gives you an idea of the variability of your model performance.

I'd play around with some open binary data set that has 100,000 or so samples and try out your proposal yourself. Run two simulations many times: One where you train 80% and test on 20% and record the miss-classification error each time, and another where you train on (n-5000) and test on 5000 and record the miss-classification error each time. Take a look at the stored errors and compare their means and standard deviations. my bet is the 20% will be much more stable.


It is just a rule of thumb, the bigger your test set the more accurate your performance measure. In reality most people use k-fold cross validation to get a better performance estimate than a 80/20 split. The higher k, the lower variance your estimator will be. Even better would be take-1-out cross validation however this gets computationally very expensive. If there is stochastic behavior in your model, then an even better approach would be to repeat take-1-out cross validation multiple times due to training on the same set will yield a different model. Making a choice in this regard is simply a trade-off between the variance in your estimate and the computational costs of training your model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.