Matching documents from different sets with tfidf and cosine distance
I have two different set of documents S1
, S2
, with 30 text documents each.
Using some text representation method, such as tfidf and a distance measure, such as cosine similarity, I want to match similar documents from the two sets S1, S2
.
For example D1 from S1
is similar (say 0.36 similar ) to D28 from S2
.
My problem is that Tfidf.Vectorizer()
creates an array of 30, 5000
for S1
and 30, 4500
for S2
, with 30 rows for each document and words of all the documents as columns.
If I calculate cosine_similarity=(S1, S2)
vectorized products, I will simply get the similarity between sets as a whole, which is not what I am trying here. I am not interested in finding similarity between documents inside the same set.
Question is:
Is there a way to vectorize each document, for each set, on its own and then calculate pairs distance?
Or is there a way to implement the above method, and then find which documents are similar, based on the tfidf matrix of the set, as described above?
Topic document-term-matrix similar-documents cosine-distance tfidf dataset
Category Data Science