Confidence intervals for evaluation on test set
I'm wondering what the best practise approach is for finding confidence intervals when evaluation the performance of a classifier on the test set.
As far as I can see, there are two different ways of evaluating the accuracy of a metric like, say, accuracy:
Evaluate the accuracy using the formula interval = z * sqrt( (error * (1 - error)) / n), where n is sample size, error is classification error (i.e. 1-accuracy) and z is a number representing multiples of gaussian standard deviations.
Train split the test set into k folds and train k classifiers, leaving a different fold out for each. Then evaluating all of these on the test set and calculating mean and variance.
Intuitively, I feel like the latter would give me an estimate on how sensitive the performance is to changes in the data whereas the former would give allow me to compare two different models directly.
I have to say I'm a bit confused...
Topic uncertainty confidence classification statistics machine-learning
Category Data Science