doc2vec - paragraph or article as document

jonas

2021年1月9日 16:24

I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data.

Should I split every Wikipedia article by each natural paragraph into several documents or use one article as a document to train my model?

EDIT: Is there an estimate on how many words per document for doc2vec?

Topic doc2vec wikipedia gensim nlp

Category Data Science

doc2vec - paragraph or article as document

About