Interpreting confidence interval results for datasets

Question

Interpreting confidence interval results for datasets

dmnte

2022年5月20日 01:02

I have created a dataset automatically and wanted to clarify my interpretation of the amount of noise using the confidence interval.

I selected a random sample and manually annotated the sample and found that 98% of the labels were correct. Based on these values I then calculated the confidence interval at 99% which gave a lower bound of 0.9614 and upper bound of 0.9949. Does this mean that the noise in the overall dataset is between the lower and upper bound and is then from 0.005% to 0.038%?

Topic confidence text-classification dataset statistics

Category Data Science

Robert Long · Accepted Answer · 2020年8月28日 18:51

No, that isn't what it means.

For one thing it is not clear what parameter the confidence interval that you calculated is for.

In any case, some care is needed in the interpretation of (frequentist) confidence intervals.

In frequentist statistics, a confidence interval is random, and the parameter that the interval is for is fixed. In the case of a 99% interval this means that if the data were collected again, many times, and the confidence interval re-calculated each time, then 99 times out of 100 it would contain the true value of the parameter. This is the only technically correct interpretation of the frequentist confidence interval. It is often interpreted, incorrectly, as being an interval that contains the parameter with 99% probability, and that appears to be the interpretation that you are using.

Interpreting confidence interval results for datasets

About