Improving text classification & labeling in imbalanced dataset

I am trying to classify text titles (NLP) in categories. Let us say I have 6K titles that should fall into four categories.

My questions:

  1. I do not understand why in some ML techniques categories are converted into numerical values Transforming the prediction target? will this impact the model accuracy instead of using nominal values?

  2. My data is severely imbalanced towards some categories, ex: CAT A has 4K titles and CAT B has 500 title. So oversampling or under sampling could impact the accuracy as the chances of correct prediction will be higher to fall in the biggest category as the original distribution has, am I correct?

  3. Finally, titles could have brand names like corporations, products .. etc. Should this be cleaned and replaced before training the model? Because the model can guess that a text will fall into automotive category if a brand name like Toyota is in the title?

Topic imbalanced-data data-cleaning machine-learning

Category Data Science


  1. why categories are converted to numeric values? Its due to the simple fact that the most machine learning models do not accept categorical values to perform prediction. For this reason its
  2. Yes, for this reason there are some techniques(like SMOTE) to ensure the data is rightly balanced. You can also opt for other metrics like F1 score which works for imbalanced data.
  3. Its ideal to clean and replace prior training the model(your example of toyota falls under automotive category)

Few Techniques to remember while dealing with imbalanced text data

  1. remove duplicate data: ensuring duplicates of texts with same semantic meaning (eg: where is my product and where is the product is one and the same)

  2. Merge minority classes

  3. resampling Dataset

    • undersampling majority class
    • oversampling minority class(like SMOTE)
  4. Data Augmentation(using spacy, space_wordnet, word embeddings

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.