Bag of words: Prediction on new (out-of-sample) data

Question

Bag of words: Prediction on new (out-of-sample) data

Peter

2020年6月29日 12:13

I'm working with a bag of words in R:

library(tm)
corpus = VCorpus(textsource)
dtm = DocumentTermMatrix(corpus)
dtm = as.matrix(dtm)

I use the matrix dtm to train a lasso model.

Now I want to predict new (unseen) text. The problem is, that I need to generate a new dtm (for prediction) with the same matrix columns as the original dtm used for model training.

Essentially, I need to populate the original dtm (as used for training) with new text.

Example: original text would yield a dtm used for taining:

original | text
1          1

While new (unseen) text, e.g. new text should yield a dtm for prediction:

original | text
0          1

Q: What is the most efficient way to populate an existing document term matrix / bag of words with new (text) data in R?

Topic document-term-matrix bag-of-words text-classification r

Category Data Science

Bag of words: Prediction on new (out-of-sample) data

About