How to down\up sample text?

Question

How to down\up sample text?

JamseGoldman

2022年3月27日 11:02

I have data set of 5566 samples - one column is the text of the recipe description and the other is what tax class is it.

I wish to make a classifier that would classify receipts using ML only.

I have a huge imbalance in the data:

What is a good method to do when dealing with this kind of data?

How to downsample or upsample? from what I understood SMOT will not work.

Topic text-classification text

Category Data Science

Erwan · Accepted Answer · 2022年3月27日 11:02

Downsampling is easy, one can always select a random subset of instances for specific classes. However Upsampling means either repeating the same instances or generating artificial data, and this rarely works well with text data (see for instance this explanation). Additionally, resampling is rarely a good solution to class imbalance.

Practically, it's unlikely that anything can be done about the classes with very few instances. I'd suggest (like in the linked answer) to start by classifying the 10 most frequent classes only, discarding anything else. If this works decently you can progressively improve later.

How to down\up sample text?

About