How to find similar document (using gensim) given two or more other documents?
I am developing a similarity program to compare documents, and I’ve successfully trained my model with Gensim (TFIDF and LSI) in order to compare two documents of each other, and it works great. I can give it document A, and get a list of documents that are similar to it.
I wonder: is there a way to take multiple input documents and get a list of documents that are similar to them? I.e. I can give it documents A and B, and then get a list of similar documents?
I can think of a couple strategies.
- Simply combine the text of the documents (which is already cleaned / preprocessed including Lemmatization and so on), and create a new TFIDF vector for that new “document,” and compare that with the database.
- I could somehow take the list of similar documents for A and B, and use an intersection function to see what documents match both.
- There might be some other math voodoo to create a composite document vector, instead of just combining the text and recalculating, and use that to identify matches.
- Additionally, another idea might be to load up the match values as a graph. If I have cached the top 100 matches for each document, I could create a weighted graph of all documents where I have A-B (with the edge having the weight of the similarity), and do a graph analysis to find other documents that are “between” the two (or more) documents that I want to composite.
Is there a way to do this, generally or through the Gensim API specifically? What would be the best way to put this together?
Topic gensim tfidf nlp python similarity
Category Data Science