Text classification with multiple documents per labeled datapoint

I have a dataset with a label TRUE or FALSE for each person, but each person has multiple documents associated with them (emails and documents).

Right now I use a Random Forest Classifier on a bag of words consisting of all words in all documents put together per person (so that I have one row with all words and a label). It performs reasonably well, but I was wondering if you guys have some suggestions about how I can use the information of separate documents.

When I try to find information about this I only encounter multi-label classification, which is the exact opposite problem: multiple labels per document, instead of multiple documents per label.

Topic multi-instance-learning classification nlp

Category Data Science


Why don't you make a person id and add this to your model?

If I understand you correctly, you do:

$$y=\beta X$$,

where each row in $X$ are combined docs per person and $y$ is a vector of true/false, right?

You could try:

$$ y= \beta X + \gamma z$$,

where each row in $X$ is only one doc now and $z$ is a vector of ids per person (so a factor).

Might be worth a try.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.