How should labeled data from multiple annotators be prepared for ML text classification?
My specific question is how NLP data from multiple human annotators should be aggregated - though general advice related to the question title is appreciated. One critical step that I've seen in research is to assess inter-annotator agreement by Cohen's kappa or some other suitable metric; I've also found research reporting values for various datasets (e.g. here), which is helpful for baselining.
How many annotators should work on each data point depends on time, personnel, and data size requirements/constraints, among other factors (I may ask a followup question for how to find optimal n). However, once n annotators have finished a dataset, how should those n datasets be unified for a "ground truth"? A couple approaches that I have seen used or seem reasonable to me:
Take averages over all annotators. Classification problems are sometimes hard to restate as graduated ones, although that seems necessary if an average is to be taken.
Express some level of uncertainty in the data for controversial labels, or even omit them from training and evaluation.
Add an arbitration step to unify or discard controversial labels. I am not sure this would be worth the annotators' time.
Choose some "principal annotator(s)" (possibly determined by IAA scores) who get the final word in conflicts.
Guidance/references for the above and any other steps I can take to make a high quality dataset are much appreciated. I am mostly interested in efficiently removing individual annotator bias even when n is low.
Topic annotation dataset nlp
Category Data Science