"Object" Detection in Textual Data
I have a task where the input is a parsed document (i.e., full text in 1 string or tokens) and I need to classify parts of the text into say 5 classes (i.e., 5 tokens from the entire text are labeled into 5 different classes).
Example:
Document #1: ... cat ... (the token cat belongs to class 0 which is animals)
Document #2: ... fish ... (the token fish belongs to class 0 which is animals)
It is important to note that at inference time, I have the entire document (in text), and so most of the tokens from it do not belong to any of the classes.
What would be a good approach to this task? I thought about a simple classification problem where I take the labeled tokens from each document and input it into an RNN classifier, but that ignores the rest of the document and at test time irrelevant tokens can have larger probabilities than the labeled tokens.
I also had an idea inspired by YOLO, and maybe apply a 1D CNN object detector (with the respective number of classes) on the entire text. Is this reasonable?
Thanks.
Topic document-understanding object-detection nlp
Category Data Science