How to handle imbalanced NLP text data set e.g. some classes only have 2 records

I am working on a dataset with around 2000 records.

Around 80% records have their the categorical labels.

There are around 200 categories, some categories got more than 20 records; whereas others only have TWO....

Considering this is a text dataset, so I cannot do the oversampling for minority categories with techniques like what I could do for images.

I am using Fast AI which is based on PyTorch.

So what can I do for it?

Topic fastai pytorch class-imbalance nlp

Category Data Science


Honestly it's unlikely that the model can handle these very small classes, because a few instances cannot be a sufficiently representative sample. Text implies high variations (there are many ways to express the same thing) so large samples are needed to cover the diversity of language.

Even larger classes might be difficult to learn, it depends how much they differ from each other in the data.

My advice:

  1. Start by implementing the system with only frequent enough classes, for instance remove the instances for classes which have less than 10 instances.
  2. Evaluate the system to make sure that it works well enough in this case.
  3. Only when/if everything works, try to deal with the harder case of small classes.

In some cases it can make sense to merge all the small classes into one single "other" class, but it depends on what the classes represent.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.