How to find similar document (using gensim) given two or more other documents?

I am developing a similarity program to compare documents, and I’ve successfully trained my model with Gensim (TFIDF and LSI) in order to compare two documents of each other, and it works great. I can give it document A, and get a list of documents that are similar to it.

I wonder: is there a way to take multiple input documents and get a list of documents that are similar to them? I.e. I can give it documents A and B, and then get a list of similar documents?

I can think of a couple strategies.

  1. Simply combine the text of the documents (which is already cleaned / preprocessed including Lemmatization and so on), and create a new TFIDF vector for that new “document,” and compare that with the database.
  2. I could somehow take the list of similar documents for A and B, and use an intersection function to see what documents match both.
  3. There might be some other math voodoo to create a composite document vector, instead of just combining the text and recalculating, and use that to identify matches.
  4. Additionally, another idea might be to load up the match values as a graph. If I have cached the top 100 matches for each document, I could create a weighted graph of all documents where I have A-B (with the edge having the weight of the similarity), and do a graph analysis to find other documents that are “between” the two (or more) documents that I want to composite.

Is there a way to do this, generally or through the Gensim API specifically? What would be the best way to put this together?

Topic gensim tfidf nlp python similarity

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.