How to match a corpus with a string of words using a TF-IDF matrix?
I am trying to match strings of words with a website that has bulletpoints whose text is most similar to it. The way I thought of doing it is to get all of the documents from each bulletpoint into one corpus per website, that I would like to match a string of words with, discard stop words, and then lemmatize everything. Then, for each string of text, I create a TF-IDF sparse matrix, with each row the text from a single bulletpoint from a single website, so that the matrix contains all the text from the bulletpoints from all the websites, as well as a row for the string of words I want to match.
How should I then decide which row my string of words is most similar to? Should I get the cosine similarity of every row with my string of words row and just take whatever one has the highest cosine similarity (I will have a way of identifying the row with the website it was scrapped from)? Or is there an actual formalized way to go about this once I have my sparse matrix?
Topic text-classification tfidf nlp
Category Data Science