Is my model classification overfitting?

Question

2022年5月16日 20:38

Is this possible to be just bad draw on the 20% or is it overfitting. I'd appreciate some tips on what's going on, thanks

Erwan · Accepted Answer · 2022年5月15日 17:52

A few comments:

You don't mention number of classes or distribution. Unless the classes are balanced, you should use precision/recall/f1-score instead of accuracy (if your majority class is 75%, accuracy can be 75% just by always predicting this class).
It's also unclear what your validation set is used for?
When your feature is represented as bag of words, it's not one feature anymore, it's as many as the vocabulary size. This is important because if it's very large you're very likely to have overfitting. Btw this is certainly why you improve performance when you remove some words.
Generally you should remove all the rare words, which are useless for the model and often cause overfitting.
A difference of 78% on the validation set down to 75% on the test set is not necessarily worrying, but that depends on other factors.