Unbalanced multiclass data with XGBoost

I have 3 classes with this distribution:

Class 0: 0.1169
Class 1: 0.7668
Class 2: 0.1163

And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.

But how is it handled for 'multiclass' case, and how can I properly set it?

Topic xgboost multiclass-classification class-imbalance classification

Category Data Science


The parameter scale_pos_weight works for two classes (binary classification).

The parameter weight goes into the xgb.DMatrix function can be used for three or more classes. The weights can be computed like this:

weights = total_samples / (n_classes * class_samples * 1.0)

For sklearn version < 0.19

Just assign each entry of your train data its class weight. First get the class weights with class_weight.compute_class_weight of sklearn then assign each row of the train data its appropriate weight.

I assume here that the train data has the column class containing the class number. I assumed also that there are nb_classes that are from 1 to nb_classes.

from sklearn.utils import class_weight
classes_weights = list(class_weight.compute_class_weight('balanced',
                                             np.unique(train_df['class']),
                                             train_df['class']))

weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
    weights[i] = classes_weights[val-1]

xgb_classifier.fit(X, y, sample_weight=weights)

Update for sklearn version >= 0.19

There is simpler solution

from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=train_df['class']
)

xgb_classifier.fit(X, y, sample_weight=classes_weights)

Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!

I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes.

Thanks wacax


This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.

weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.

For example, if we have three imbalanced classes with ratios

class A = 10%
class B = 30%
class C = 60%

Their weights would be (dividing the smallest class by others)

class A = 1.000
class B = 0.333
class C = 0.167

Then, if training data is

index   class
0       A
1       A
2       B
3       C
4       B

we build the weight vector as follows:

index   class    weight
0       A        1.000
1       A        1.000
2       B        0.333
3       C        0.167
4       B        0.333

scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:

sum(negative instances) / sum(positive instances)

For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:

xgboost.DMatrix(..., weight = *weight array for individual weights*)

You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.