Matching documents from different sets with tfidf and cosine distance

Question

Matching documents from different sets with tfidf and cosine distance

forgotten_novel_char

2021年4月16日 10:14

I have two different set of documents S1, S2, with 30 text documents each. Using some text representation method, such as tfidf and a distance measure, such as cosine similarity, I want to match similar documents from the two sets S1, S2.

For example D1 from S1 is similar (say 0.36 similar ) to D28 from S2.

My problem is that Tfidf.Vectorizer() creates an array of 30, 5000 for S1 and 30, 4500 for S2, with 30 rows for each document and words of all the documents as columns.

If I calculate cosine_similarity=(S1, S2) vectorized products, I will simply get the similarity between sets as a whole, which is not what I am trying here. I am not interested in finding similarity between documents inside the same set.

Question is:

Is there a way to vectorize each document, for each set, on its own and then calculate pairs distance?

Or is there a way to implement the above method, and then find which documents are similar, based on the tfidf matrix of the set, as described above?

Topic document-term-matrix similar-documents cosine-distance tfidf dataset

Category Data Science

Matching documents from different sets with tfidf and cosine distance

About