Class imbalance: Will transforming multi-label (aka multi-task) to multi-class problem help?

I noticed this and this questions, but my problem is more about class imbalance. So now I have, say, 1000 targets and some input samples (with some feature vectors). Each input sample can have label '1' for many targets (currently tasks), meaning they interact. Label '0' means they don't interact (for each task, it is a binary classification problem).

Unbalanced data

My current issue is: For most targets there are 1% samples (perhaps 1 or 2) that are labelled 1. Since I have to split train-val-test and calculate AUROC, there are in fact only 3 targets left that can support the classification under some threshold (say, have 5% positive labels across all samples).

Transform or not?

Someone has suggested modeling this as a multi-class problem instead of a multi-task problem, meaning I would transform the label vector of each sample into a set of label-1 targets. For example, if sample A originally has label 1 for targets 12, 232, 988 (and 0 for all others), the new label for sample A would simply be {12, 232, 988}=label_id.

But this might make the situation worse, because now a target (task) does not share labels across samples, e.g., if sample B interacts with target 12 and 232 only, originally targets (tasks) 12 and 232 would have two positively labeled data points, but now those two samples become totally different.


Would appreciate any suggestions! Side note: I'm using simple classifiers such as MLP or SVM. If there are any specific methods designed for imbalanced data (which I've never heard of), that would also be wonderful.

Topic imbalanced-data imbalanced-learn multilabel-classification multiclass-classification class-imbalance

Category Data Science


It's the opposite: if you had such a multi-class problem where a class represents a subset of labels, then it might help to transform it into a multi-label problem in order to have more instances for every label. In your case, it's very likely that the multi-class setting will make the imbalance problem worse, with very few instances for most classes. The exception would be if there are very few possible combinations of labels.

Essentially training a model requires a sufficiently large representative sample, in particular for every class/label. I'll be blunt: Every target which has only 1 or 2 positive instances is probably not usable at all. An obvious problem you'll have with those cases is that you can't test the model reliably with (at best!!) a single positive instance in the test set. And you certainly can't build a regular ROC curve.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.