Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it fits with the training docs. Is there a way to do this?

Topic doc2vec semantic-similarity nlp

Category Data Science


There are many possible approaches:

  • using simple similarity measures (e.g. cosine) to compare the new document against every training document.
  • train a binary classifier to distinguish the reference documents from "anything else". Requires negative examples, it's usually difficult to have a representative sample.
  • Use one-class classification, i.e. train a model using only the reference documents. The model tries to represent this class of documents and considers anything else as negative.
  • Could even consider it as a regression problem, i.e. score documents by how similar they are from the reference documents.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.