"Object" Detection in Textual Data

I have a task where the input is a parsed document (i.e., full text in 1 string or tokens) and I need to classify parts of the text into say 5 classes (i.e., 5 tokens from the entire text are labeled into 5 different classes).

Example:

Document #1: ... cat ... (the token cat belongs to class 0 which is animals)

Document #2: ... fish ... (the token fish belongs to class 0 which is animals)

It is important to note that at inference time, I have the entire document (in text), and so most of the tokens from it do not belong to any of the classes.

What would be a good approach to this task? I thought about a simple classification problem where I take the labeled tokens from each document and input it into an RNN classifier, but that ignores the rest of the document and at test time irrelevant tokens can have larger probabilities than the labeled tokens.

I also had an idea inspired by YOLO, and maybe apply a 1D CNN object detector (with the respective number of classes) on the entire text. Is this reasonable?

Thanks.

Topic document-understanding object-detection nlp

Category Data Science


This looks quite similar to Named Entity Recognition (NER), which is traditionally done with a sequence labeling model such as Conditional Random Fields. Normally NER is used when:

  • The list of possible entities is not predefined: the training data might contain "Mr James Smith" but the test data could contain "Mr John Doe". In other words, the classes are open.
  • It is assumed that the context of the text can help the model predict an entity. For example in a sentence like "Today X said that ...", the word "said" after X should help the model predict that X is either a person or an organization, but it cannot be a location.

I'm not sure I fully understand the question, but if you have handwritten text, so the word 'cat' can be written in many different ways, you could train an object detector like YOLO or Faster R-CNN to detect this word (e.g. on your data or OS dataset like ICDAR2015-FST) or even separate characters therein. If, on the other hand, yo want to identify unseen words and classify them into one of the classes, I don't think it's possible.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.