Creating Dataset for Classification, How much balanced a good dataset should be?

I am creating a dataset with 4 classes, and there are 50K rows, I am already getting 86% accuracy, 0.85 Precison, 0.86 Recall and 0.71 F1-Score on SVM with 80,20 split.

I have to publish this dataset in a research paper, but I am concerned about the class %age distribution. For example, Class 1 has more data than Class 4. (Dataset Annotation is already done)

Dataset is scraped from Twitter, Technically I can not force users to post specifically about Class 4 but on the other hand, I think that skewed distribution will/may affect the results and reviewers may mention it.

So, in this case, what should I do?

Delete some rows and make data equally distributed? like 25% each

or

let it go like it is already.

What should a data scientist do? (Consider that I am new in this field, so need an expert opinion with logic)

I also have another dataset for binary classification which has the same problem.

Topic binary-classification twitter classification dataset

Category Data Science


The answer to such wide questions always have the same answer - it depends. Since I don't know the exact ratio of the four classes, I'll mention a few important points that can help you decide how to move forward.

Different people set different thresholds for when their dataset is imbalanced, I'm refraining from giving you an exact percentage but the general idea is that dataset is called imbalanced if the model can't even generalize 'good enough' on minority classes.

'Good enough' can imply various things, for starters, if you see that accuracy of minority classes on validation/test set is considerably lower than other classes then your model is having trouble learning the pattern. You can visualize it using Confusion Matrix.

Alternatively, there are other metrics that you can employ to measure the performance of your model. These metrics, unlike accuracy, are invariant to imbalance. Examples are Precision and Recall, F1 Score, etc. Here's a short blog that introduces more metrics.

You can try to artificially oversample minority class, or even artificially synthesize data using techniques such as SMOTE among various others. Undersampling is usually not an excellent option unless you're running short on space/time but that could be done as well. Imbalanced-learn is your friend here.

Finally, if you're not sure what is the best option, then you can dogmatically choose the one that is performing the best on your validation set.

People may have more ideas on how to tackle imbalanced datasets, but more often than not, people misconstrue the 'degree of imbalance' of their datasets.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.