Identify outliers for annotation in text data

Question

Identify outliers for annotation in text data

Mykola Zotko

2020年12月26日 18:39

I read the book Human-in-the-Loop Machine Learning by Robert (Munro) Monarch about Active Learning. I don't understand the following approach to get a diverse set of items for humans to label:

Take each item in the unlabeled data and count the average number of word matches it has with items already in the training data

Rank the items by their average match

Sample the item with the lowest average number of matches

Add that item to the ‘labeled’ data and repeat 1-3 until we have sampled enough for one iteration of human review

It's not clear how to calculate the average number of word matches.

Topic annotation active-learning nlp machine-learning

Category Data Science

Erwan · Accepted Answer · 2020年12月26日 18:39

The idea is to find the documents which are not well represented in the current labeled data. The first point is indeed a bit vague and can probably be interpreted in different ways. My interpretation would be something like this:

For every document $d_u$ in the unlabeled data, count the number of words in common with every document $d_l$ in the labeled data. This value is the "match score" between $d_u$ and $d_l$.
- Note: I think that this value should be normalized, for example using the overlap coefficient. Note that other similarity measures could be used as well, for instance cosine-TFIDF.
As output from the above step, for a single document $d_u$ one obtains a "match score" for every labeled document. The average across the labeled documents gives the "average match" for $d_u$.

Identify outliers for annotation in text data

About