How to match a corpus with a string of words using a TF-IDF matrix?

Question

How to match a corpus with a string of words using a TF-IDF matrix?

sangstar

2022年4月26日 17:00

I am trying to match strings of words with a website that has bulletpoints whose text is most similar to it. The way I thought of doing it is to get all of the documents from each bulletpoint into one corpus per website, that I would like to match a string of words with, discard stop words, and then lemmatize everything. Then, for each string of text, I create a TF-IDF sparse matrix, with each row the text from a single bulletpoint from a single website, so that the matrix contains all the text from the bulletpoints from all the websites, as well as a row for the string of words I want to match.

How should I then decide which row my string of words is most similar to? Should I get the cosine similarity of every row with my string of words row and just take whatever one has the highest cosine similarity (I will have a way of identifying the row with the website it was scrapped from)? Or is there an actual formalized way to go about this once I have my sparse matrix?

Topic text-classification tfidf nlp

Category Data Science

Erwan · Accepted Answer · 2021年7月14日 11:15

The problem you describe looks very close to standard information retrieval: given a predefined set of documents $D$ and an input string $s$, find the most similar document $d\in D$ to $s$ (alternatively find the top $n$ documents $d$ most similar to $s$).

The approach you describe is good, except that in general the input string $s$ is not part of the TFIDF matrix: indeed the full set of predefined documents is encoded as a TFIDF matrix, but then any input string $s$ is simply encoded using the same vocabulary and weights. The advantage is that you don't need to recompute the matrix for every different string $s$ (the matrix can be pre-computed and stored for efficiency reasons). There is no disadvantage because any word in $s$ which is not in the vocabulary cannot be used in the calculation of the similarity anyway.

Indeed the standard method for matching or ranking the documents with respect to $s$ is to calculate a similarity score (e.g. cosine) for every $d$ against $s$, and then pick the highest similarity score.

How to match a corpus with a string of words using a TF-IDF matrix?

About