Sentiment Analysis Label Distribution

Question

Sentiment Analysis Label Distribution

shrish123 kumar

2022年6月4日 03:02

I am working on Sentiment Analysis model. The dataset I have has three labels: positive, negative and neutral.

But the problem is the data is not equal for labels. Say out of 100K : 75 K are neutral, 15K positive and 10K negative.

I wanted to know whether it is necessary to choose equal distribution of labels while training or I can go ahead with unequal data and if so till what extent? Are there any ways to deal with such problem?

Topic sentiment-analysis

Category Data Science

Shrinidhi M · Accepted Answer · 2021年8月25日 06:08

Try this: Compute class weights for the labels in the train set and then pass these weights to the loss function so that it takes care of the class imbalance. In pytorch it can be done as shown below:

from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels)

print("Class Weights:",class_weights)

# converting list of class weights to a tensor
weights= torch.tensor(class_weights,dtype=torch.float)
    
# define the loss function
cross_entropy  = nn.NLLLoss(weight=weights)

Özkan D. · Accepted Answer · 2020年2月26日 13:32

Your dataset is very much imbalanced. There is one major class (neutral), and two minor classes (positive and negative). If you build a machine learning algorithm to solve this classification problem, there is a high risk that the predictions are going to be biased towards to majority class.

The solutions to prevent this problem is:

Oversampling the minority classes, creating synthetic data points etc. (Such as SMOTE)
Down-sampling the majority class.

The evaluation of the model can be completed by using AUC Score, Recall, Precision, F1 Scores.

vipin bansal · Accepted Answer · 2020年2月26日 11:02

For training, close to equal distributed data will give you better results.

Type of data that you have, generally produced a biased model towards "neutral" class.

Are there any ways to deal with such problem?

I generally perform oversampling of the minority classes, such that for training(only), have sufficient uniform count of data set. SMOTE, ADASYN are the few techniques of oversampling.

Sentiment Analysis Label Distribution

About