How to down\up sample text?

I have data set of 5566 samples - one column is the text of the recipe description and the other is what tax class is it.

I wish to make a classifier that would classify receipts using ML only.

I have a huge imbalance in the data:

What is a good method to do when dealing with this kind of data?

How to downsample or upsample? from what I understood SMOT will not work.

Topic text-classification text

Category Data Science


Downsampling is easy, one can always select a random subset of instances for specific classes. However Upsampling means either repeating the same instances or generating artificial data, and this rarely works well with text data (see for instance this explanation). Additionally, resampling is rarely a good solution to class imbalance.

Practically, it's unlikely that anything can be done about the classes with very few instances. I'd suggest (like in the linked answer) to start by classifying the 10 most frequent classes only, discarding anything else. If this works decently you can progressively improve later.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.