How to represent a document in test data with the Document-Term Matrix created from the training set?

I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?

Topic vector-space-models lsi text-mining python

Category Data Science


If you choose to use scikit-learn's CountVectorizer, words that appear in the test dataset but not in the training dataset are automatically ignored.

The fit_transform method is called on training data creating the document-term matrix. The transform only method is called on the test data which transforms those documents to document-term matrix created in training, automatically dropping any new terms.


The easiest way is to treat all out-of-vocabulary terms as a specific term in your matrix (i.e. "OOV").

So for instance, if my training data contains 3 words: "I", "like", "cake", my document-term matrix would contain 4 items, "I", "like", "cake", and "OOV".

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.