Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

Question

Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

sangstar

2021年12月6日 00:26

I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it fits with the training docs. Is there a way to do this?

Topic doc2vec semantic-similarity nlp

Category Data Science

Erwan · Accepted Answer · 2021年12月6日 00:26

There are many possible approaches:

using simple similarity measures (e.g. cosine) to compare the new document against every training document.
train a binary classifier to distinguish the reference documents from "anything else". Requires negative examples, it's usually difficult to have a representative sample.
Use one-class classification, i.e. train a model using only the reference documents. The model tries to represent this class of documents and considers anything else as negative.
Could even consider it as a regression problem, i.e. score documents by how similar they are from the reference documents.

Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

About