Creating Dataset for Classification, How much balanced a good dataset should be?
I am creating a dataset with 4 classes, and there are 50K rows, I am already getting 86% accuracy, 0.85 Precison, 0.86 Recall and 0.71 F1-Score on SVM with 80,20 split.
I have to publish this dataset in a research paper, but I am concerned about the class %age distribution. For example, Class 1 has more data than Class 4. (Dataset Annotation is already done)
Dataset is scraped from Twitter, Technically I can not force users to post specifically about Class 4 but on the other hand, I think that skewed distribution will/may affect the results and reviewers may mention it.
So, in this case, what should I do?
Delete some rows and make data equally distributed? like 25% each
or
let it go like it is already.
What should a data scientist do? (Consider that I am new in this field, so need an expert opinion with logic)
I also have another dataset for binary classification which has the same problem.
Topic binary-classification twitter classification dataset
Category Data Science