How should labeled data from multiple annotators be prepared for ML text classification?

My specific question is how NLP data from multiple human annotators should be aggregated - though general advice related to the question title is appreciated. One critical step that I've seen in research is to assess inter-annotator agreement by Cohen's kappa or some other suitable metric; I've also found research reporting values for various datasets (e.g. here), which is helpful for baselining.

How many annotators should work on each data point depends on time, personnel, and data size requirements/constraints, among other factors (I may ask a followup question for how to find optimal n). However, once n annotators have finished a dataset, how should those n datasets be unified for a "ground truth"? A couple approaches that I have seen used or seem reasonable to me:

  • Take averages over all annotators. Classification problems are sometimes hard to restate as graduated ones, although that seems necessary if an average is to be taken.

  • Express some level of uncertainty in the data for controversial labels, or even omit them from training and evaluation.

  • Add an arbitration step to unify or discard controversial labels. I am not sure this would be worth the annotators' time.

  • Choose some "principal annotator(s)" (possibly determined by IAA scores) who get the final word in conflicts.

Guidance/references for the above and any other steps I can take to make a high quality dataset are much appreciated. I am mostly interested in efficiently removing individual annotator bias even when n is low.

Topic annotation dataset nlp

Category Data Science


You ideally want a copy of The Handbook of Linguistic Annotation which covers the issues you’re up against in detail.

The basic idea is:

  • Create annotation guidelines as a training tool to increase interannotator agreement as far as possible
  • Measure interannotator agreement among people using your guidelines to get an idea of the irreducible error
  • Generate as much annotated data as you can

If you have created clear annotation guidelines, you should be able to aggregate data from multiple annotators with no additional processing; you just have to have the kappa that relates to that set of guidelines as a caveat attached to the training data.


Most machine learning algorithms are designed with complete trust in the labels. There is no standard way to model uncertainty in data labels. Thus, create a project-specific threshold for uncertainty to omit data or labelers. For example, a trusted classification data label would require n of m ensemble voting.

One major issue is re-labeling. Systems tend to evolve over time and the definition of labels is refined. Mature data labeling system have a notion of data lineage - "Who labeled what data when with what criteria".

The book "Human-in-the-Loop Machine Learning" by Robert Munro goes into greater detail.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.