Improving text classification & labeling in imbalanced dataset
I am trying to classify text titles (NLP) in categories. Let us say I have 6K titles that should fall into four categories.
My questions:
I do not understand why in some ML techniques categories are converted into numerical values Transforming the prediction target? will this impact the model accuracy instead of using nominal values?
My data is severely imbalanced towards some categories, ex: CAT A has 4K titles and CAT B has 500 title. So oversampling or under sampling could impact the accuracy as the chances of correct prediction will be higher to fall in the biggest category as the original distribution has, am I correct?
Finally, titles could have brand names like corporations, products .. etc. Should this be cleaned and replaced before training the model? Because the model can guess that a text will fall into automotive category if a brand name like Toyota is in the title?
Topic imbalanced-data data-cleaning machine-learning
Category Data Science