Text classification with multiple documents per labeled datapoint
I have a dataset with a label TRUE
or FALSE
for each person, but each person has multiple documents associated with them (emails and documents).
Right now I use a Random Forest Classifier on a bag of words consisting of all words in all documents put together per person (so that I have one row with all words and a label). It performs reasonably well, but I was wondering if you guys have some suggestions about how I can use the information of separate documents.
When I try to find information about this I only encounter multi-label classification, which is the exact opposite problem: multiple labels per document, instead of multiple documents per label.
Topic multi-instance-learning classification nlp
Category Data Science