How should labeled data from multiple annotators be prepared for ML text classification?

Question

How should labeled data from multiple annotators be prepared for ML text classification?

l_l_l_l_l_l_l_l

2020年4月14日 23:29

My specific question is how NLP data from multiple human annotators should be aggregated - though general advice related to the question title is appreciated. One critical step that I've seen in research is to assess inter-annotator agreement by Cohen's kappa or some other suitable metric; I've also found research reporting values for various datasets (e.g. here), which is helpful for baselining.

How many annotators should work on each data point depends on time, personnel, and data size requirements/constraints, among other factors (I may ask a followup question for how to find optimal n). However, once n annotators have finished a dataset, how should those n datasets be unified for a "ground truth"? A couple approaches that I have seen used or seem reasonable to me:

Take averages over all annotators. Classification problems are sometimes hard to restate as graduated ones, although that seems necessary if an average is to be taken.
Express some level of uncertainty in the data for controversial labels, or even omit them from training and evaluation.
Add an arbitration step to unify or discard controversial labels. I am not sure this would be worth the annotators' time.
Choose some "principal annotator(s)" (possibly determined by IAA scores) who get the final word in conflicts.

Guidance/references for the above and any other steps I can take to make a high quality dataset are much appreciated. I am mostly interested in efficiently removing individual annotator bias even when n is low.

Topic annotation dataset nlp

Category Data Science

Nicholas James Bailey · Accepted Answer · 2020年4月14日 23:29

You ideally want a copy of The Handbook of Linguistic Annotation which covers the issues you’re up against in detail.

The basic idea is:

Create annotation guidelines as a training tool to increase interannotator agreement as far as possible
Measure interannotator agreement among people using your guidelines to get an idea of the irreducible error
Generate as much annotated data as you can

If you have created clear annotation guidelines, you should be able to aggregate data from multiple annotators with no additional processing; you just have to have the kappa that relates to that set of guidelines as a caveat attached to the training data.

Brian Spiering · Accepted Answer · 2020年4月14日 23:05

Most machine learning algorithms are designed with complete trust in the labels. There is no standard way to model uncertainty in data labels. Thus, create a project-specific threshold for uncertainty to omit data or labelers. For example, a trusted classification data label would require n of m ensemble voting.

One major issue is re-labeling. Systems tend to evolve over time and the definition of labels is refined. Mature data labeling system have a notion of data lineage - "Who labeled what data when with what criteria".

The book "Human-in-the-Loop Machine Learning" by Robert Munro goes into greater detail.

How should labeled data from multiple annotators be prepared for ML text classification?

About