How to represent a document in test data with the Document-Term Matrix created from the training set?

Question

How to represent a document in test data with the Document-Term Matrix created from the training set?

Paw in Data

2022年4月10日 09:01

I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?

Topic vector-space-models lsi text-mining python

Category Data Science

Brian Spiering · Accepted Answer · 2021年2月18日 17:11

If you choose to use scikit-learn's CountVectorizer, words that appear in the test dataset but not in the training dataset are automatically ignored.

The fit_transform method is called on training data creating the document-term matrix. The transform only method is called on the test data which transforms those documents to document-term matrix created in training, automatically dropping any new terms.

Valentin Calomme · Accepted Answer · 2020年5月5日 20:54

The easiest way is to treat all out-of-vocabulary terms as a specific term in your matrix (i.e. "OOV").

So for instance, if my training data contains 3 words: "I", "like", "cake", my document-term matrix would contain 4 items, "I", "like", "cake", and "OOV".

How to represent a document in test data with the Document-Term Matrix created from the training set?

About