How to define minority/majority class in a multi-classification task

I am studying classification in imbalanced datasets and I am learning under/over sampling strategies as a way to address the issue. While the literature agrees one needs to oversample 'minority' classes and downsample 'majority' classes, I have not been able to find a clear definition of how minority/majority is defined/measured.

While this is not much an issue in a binary classification task, my problem is a multi-classification one, where there are over 200 classes, some have tens of thousands of examples some have under a hundred. But how do you scientifically decide which are majority, which are minority?

I'd appreciate some help on this, especially if you have any references that I can ready.

Thanks

Topic imbalanced-data class-imbalance

Category Data Science


First, it's important to understand that class imbalance is not really a problem, and consequently resampling is not a very good approach (see for instance here or here). Resampling is easy to do so many tutorial for beginners insist on it, but it almost never succeeds at solving the problem because it doesn't address the real underlying issue: if a classifier assigns the majority label to many instances, it's because it doesn't know how to distinguish the classes.

So my main answer is: don't resample :)

If you really want to try resampling in the case of multiclass, there's no need to define the classes as minority or majority. Resampling is just creating an artificially balanced distribution, so this can be done in the same way with multiple classes. For example you might decide that you want 200 instances of every class, so you oversample the classes which have less than 200 and undersample the ones which have more.

Remember that resampling must only be applied to the training set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.