Should bag of words in training set include test set data when doing text classification?
I'm doing text classification to identify 'attacks' from Wikipedia comments using a simple bag of words model and a linear SVM classifier. Because of class imbalance, I'm using the F1 score as my error measure. I'm wondering if the tokens I have in the training data should also include words that exist only in the test data, or does it not matter? I was under the impression that it shouldn't matter since the counts for these features would be zero anyway in the training set. That should make them irrelevant to the model when training. Apparently, that's what some people on SO were saying as well (it didn't find any definitive answer though).
In order to test this, I decided to train my model both ways and see the difference: once with only features in the training data, and another with features that included test data. N folds for CV were set to 10. I got a very similar CV error for both of them, but when I generated predictions for my test data my F1 score was 0.06 higher for the model which included features from test data - 0.64 vs 0.58. Because this is a Kaggle assignment I cannot see the true labels for the test set. I'm inclined to believe that such a big difference can't simply be random. It seems like including all the features from test data did improve my model, but how could this be? Can anyone give me an explanation?
Topic bag-of-words text-classification text-mining svm machine-learning
Category Data Science