Identify outliers for annotation in text data

I read the book Human-in-the-Loop Machine Learning by Robert (Munro) Monarch about Active Learning. I don't understand the following approach to get a diverse set of items for humans to label:

  1. Take each item in the unlabeled data and count the average number of word matches it has with items already in the training data
  2. Rank the items by their average match
  3. Sample the item with the lowest average number of matches
  4. Add that item to the ‘labeled’ data and repeat 1-3 until we have sampled enough for one iteration of human review

It's not clear how to calculate the average number of word matches.

Topic annotation active-learning nlp machine-learning

Category Data Science


The idea is to find the documents which are not well represented in the current labeled data. The first point is indeed a bit vague and can probably be interpreted in different ways. My interpretation would be something like this:

  • For every document $d_u$ in the unlabeled data, count the number of words in common with every document $d_l$ in the labeled data. This value is the "match score" between $d_u$ and $d_l$.
    • Note: I think that this value should be normalized, for example using the overlap coefficient. Note that other similarity measures could be used as well, for instance cosine-TFIDF.
  • As output from the above step, for a single document $d_u$ one obtains a "match score" for every labeled document. The average across the labeled documents gives the "average match" for $d_u$.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.