Sentiment Analysis Label Distribution

I am working on Sentiment Analysis model. The dataset I have has three labels: positive, negative and neutral.

But the problem is the data is not equal for labels. Say out of 100K : 75 K are neutral, 15K positive and 10K negative.

I wanted to know whether it is necessary to choose equal distribution of labels while training or I can go ahead with unequal data and if so till what extent? Are there any ways to deal with such problem?

Topic sentiment-analysis

Category Data Science


Try this: Compute class weights for the labels in the train set and then pass these weights to the loss function so that it takes care of the class imbalance. In pytorch it can be done as shown below:

from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels)

print("Class Weights:",class_weights)

# converting list of class weights to a tensor
weights= torch.tensor(class_weights,dtype=torch.float)
    
# define the loss function
cross_entropy  = nn.NLLLoss(weight=weights) 

Your dataset is very much imbalanced. There is one major class (neutral), and two minor classes (positive and negative). If you build a machine learning algorithm to solve this classification problem, there is a high risk that the predictions are going to be biased towards to majority class.

The solutions to prevent this problem is:

  • Oversampling the minority classes, creating synthetic data points etc. (Such as SMOTE)
  • Down-sampling the majority class.

The evaluation of the model can be completed by using AUC Score, Recall, Precision, F1 Scores.


For training, close to equal distributed data will give you better results.

Type of data that you have, generally produced a biased model towards "neutral" class.

Are there any ways to deal with such problem?

I generally perform oversampling of the minority classes, such that for training(only), have sufficient uniform count of data set. SMOTE, ADASYN are the few techniques of oversampling.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.