Is my model classification overfitting?

Is this possible to be just bad draw on the 20% or is it overfitting. I'd appreciate some tips on what's going on, thanks

Topic oversampling

Category Data Science


A few comments:

  • You don't mention number of classes or distribution. Unless the classes are balanced, you should use precision/recall/f1-score instead of accuracy (if your majority class is 75%, accuracy can be 75% just by always predicting this class).
  • It's also unclear what your validation set is used for?
  • When your feature is represented as bag of words, it's not one feature anymore, it's as many as the vocabulary size. This is important because if it's very large you're very likely to have overfitting. Btw this is certainly why you improve performance when you remove some words.
  • Generally you should remove all the rare words, which are useless for the model and often cause overfitting.
  • A difference of 78% on the validation set down to 75% on the test set is not necessarily worrying, but that depends on other factors.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.