doc2vec - paragraph or article as document
I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data.
Should I split every Wikipedia article by each natural paragraph into several documents or use one article as a document to train my model?
EDIT: Is there an estimate on how many words per document for doc2vec?
Topic doc2vec wikipedia gensim nlp
Category Data Science