Confidence Intervals for Multi-Categorical Votes
I have an ngram-based language model that produces a long tag list for a given sentence. For example, the just-previous sentence, broken into bigrams, and run through the model might produce something like:
{I have}=>C1 {have an}=>C2 {an ngram}=>C1 {ngram based}=>C3, etc.
resulting in counts: C1=2, C2=1, C3=1 (for the shown segment above).
Easy enough to pick the winner by sorting either the counts, or after turning them into percentages, which would control for sentence length. But I want a CI on that winner -- that is, I want to know when it's a statistical tie between the top N categories (either by count or percent).
I'm sure that there's an obvious way to do this....
...Pointers appreciated!
Topic counts multiclass-classification nlp
Category Data Science