Should bag of words in training set include test set data when doing text classification?

I'm doing text classification to identify 'attacks' from Wikipedia comments using a simple bag of words model and a linear SVM classifier. Because of class imbalance, I'm using the F1 score as my error measure. I'm wondering if the tokens I have in the training data should also include words that exist only in the test data, or does it not matter? I was under the impression that it shouldn't matter since the counts for these features would be zero anyway in the training set. That should make them irrelevant to the model when training. Apparently, that's what some people on SO were saying as well (it didn't find any definitive answer though).

In order to test this, I decided to train my model both ways and see the difference: once with only features in the training data, and another with features that included test data. N folds for CV were set to 10. I got a very similar CV error for both of them, but when I generated predictions for my test data my F1 score was 0.06 higher for the model which included features from test data - 0.64 vs 0.58. Because this is a Kaggle assignment I cannot see the true labels for the test set. I'm inclined to believe that such a big difference can't simply be random. It seems like including all the features from test data did improve my model, but how could this be? Can anyone give me an explanation?

Topic bag-of-words text-classification text-mining svm machine-learning

Category Data Science


This is the problem of out of vocabulary (OOV) words.

As a rule, the training should not use anything from the test set for several reasons:

  • The risk of data leakage, which would cause an overestimated performance on the test set.
  • During training the model cannot use these words to distinguish between classes anyway, since they are not present. So it would be pointless to include them.
  • In principle, the model is meant to be used with any new input text, not only the documents in the current test set. Using the test set words as features in the model actually restricts the model, since it cannot be applied to any other text than the test set.

The correct way to deal with OOV words is either:

  • To simply ignore them completely, i.e. filter them out before applying the model.
  • To account for this possibility in the model from the start. Typically this can be done by having a special token UNKNOWN. This option is often combined with filtering out rare words from the training set: every occurrence of rare words can be replaced by the UNKNOWN token.

The cause of the higher performance when including these features is unclear. I suspect that the inclusion of these features causes the model to be different in a way that happens by chance to have a positive effect on the test set. It's quite unlikely in general, I'm not sure.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.