Confidence Intervals for Multi-Categorical Votes

I have an ngram-based language model that produces a long tag list for a given sentence. For example, the just-previous sentence, broken into bigrams, and run through the model might produce something like:

{I have}=>C1 {have an}=>C2 {an ngram}=>C1 {ngram based}=>C3, etc.

resulting in counts: C1=2, C2=1, C3=1 (for the shown segment above).

Easy enough to pick the winner by sorting either the counts, or after turning them into percentages, which would control for sentence length. But I want a CI on that winner -- that is, I want to know when it's a statistical tie between the top N categories (either by count or percent).

I'm sure that there's an obvious way to do this....

...Pointers appreciated!

Topic counts multiclass-classification nlp

Category Data Science


I found a concise description of how to do this.

(But note that this doesn't correct for sentence length, so it's not a complete solution bcs longer sentence will have more "votes".)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.