Imbalanced Dataset (Transformers): How to Decide on Class Weights?

I'm using SimpleTranformers to train and evaluate a model.

Since the dataset I am using is severely imbalanced, it is recommended that I assign weights to each label. An example of assigning weights for SimpleTranformers is given here.

My question, however, is: How exactly do I choose what's the appropriate weight for each class? Is there a specific methodology, e.g., a formula that uses the ratio of the labels?

Follow-up question: Are the weights used for the same dataset universal? I.e., if I use a totally different model, can I use the same weights or should I assign different weights depending on the model.

p.s.1. If it makes any difference, I'm using roBERTa.

p.s.2. There is a similar question here, however, I believe that my question is not a duplicate because a) the attached question is about Keras where my question is about Transformers, and b) I'm also asking about general recommendations of how weight values are decided where the attached question is not.

Topic bert transfer-learning imbalance class-imbalance

Category Data Science


Assuming that your training dataset contains a target with four (4) classes, you can assign weights as follows:

model = ClassificationModel("roberta","roberta-base",num_labels=4,
                            weight=[1, 0.5, 1, 2])

Check this link.


The point of setting class weights is to manipulate the loss function to put more focus on the minor label. In fact, each of the data point passed to your learning algorithm will contribute information to help your loss function. By making the weight of a minor instance bigger, you say to your loss function that it should put more focus on that particular (features, label). The most intuitive way class weights making impact this way is by multiplying the loss attributed with that observation by the corresponding weight.

So, imagine you have 2 classes in your training data. Class A with 100 observations while class B have 1000 observations. To make up for the imbalanced, you set the weight of class A to (1000 / 100 = 10 times) the weight of class B, which would be [1.0, 0.1].

In general, for multi-class problem, you would like to set class weights so that for each class:

# of observations for this class * class weight = constant A.

If you choose A = 1, then class weight for a class = 1 / # of observations for that class.

Below is quoted from doc:

weight (optional): A list of length num_labels containing the weights to assign to each label for loss calculation.


Regarding what particular way to set class weight, it's as simple as trying and evaluating what works based on your accuracy metrics.


I am not sure about the model you are using but I might explain what the procedure is for ML in general. You have three "vanilla" solutions for coping with unbalanced supervised dataset.

  1. Reweighing class label so that there is the same number (calculated as sum of weights for given label) of samples per label. For example if a label with maximum number of samples has $n_{max}$ samples, and some other class has $n_i$ samples, then you would assign a weight $w_i=\frac{n_{max}}{n_i}$.
  2. Undersampling - a basic procedure that gets rids of all the additional samples so that you end up with a balanced dataset.
  3. Oversampling - creating copies of the unbalanced classes (with less samples than the $n_{max}$) Hope that helps,

Max

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.