Question on bootstrap sampling

I have a corpus of manually annotated (aka gold standard) documents and a collection of NLP systems annotations on the text from the corpus. I want to do a bootstrap sampling of the system and gold standard to approximate a mean and standard error for various measures so that I can do a series of hypotheses tests using possibly ANOVA.

The issue is how do I do the sampling. I have 40 documents in the corpus with ~44K manual annotation in the gold standard. I was thinking of using each document as a sampling unit, and taking 60% of documents for each sample (or 24 documents per sample). However, the issue is that each manually annotated documents does not have the same number of annotations, so that violates using same sample size for each sample.

Any suggestions on how to achieve this bootstrap?

Topic bootstraping nlp

Category Data Science


It simply depends what you count as your object of interest: from your description the unit can be either document or annotation. Your method describes using the document as unit, it's fine as long as the tests you plan to do are compatible with this.

Another option is to use the annotation as unit: in this case you would pick 60% of the 44k annotations every time, so you would have a mix of annotations from multiple documents. Depending what you test exactly, this might be an issue, in particular I don't see how you would count False Negative cases in this way.

Since you have text documents of varying size (I assume), you could also consider different options: sentence, paragraph, block of N sentences, etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.