Bag of words: Prediction on new (out-of-sample) data
I'm working with a bag of words in R:
library(tm)
corpus = VCorpus(textsource)
dtm = DocumentTermMatrix(corpus)
dtm = as.matrix(dtm)
I use the matrix dtm
to train a lasso model.
Now I want to predict new (unseen) text. The problem is, that I need to generate a new dtm
(for prediction) with the same matrix columns as the original dtm
used for model training.
Essentially, I need to populate the original dtm
(as used for training) with new text.
Example: original text
would yield a dtm
used for taining:
original | text
1 1
While new (unseen) text, e.g. new text
should yield a dtm
for prediction:
original | text
0 1
Q: What is the most efficient way to populate an existing document term matrix / bag of words with new (text) data in R?
Topic document-term-matrix bag-of-words text-classification r
Category Data Science