Building a graph out of a large text corpus
I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node.
One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond a given threshold. However, I believe it would be better to utilize an algorithm which is based on plain text similarity measures, i.e., an algorithm which does not convert the texts into embeddings. Same as before, I would form a link between two nodes (texts) if their text similarity is beyond a given threshold. Now, the question is: what is the simplest way to measure similarity of two texts, and what would be the more sophisticated ways? I thought about first extracting the keywords out of the two texts, and then calculate Jaccard Index.
Any idea on how this could be achieved is highly welcome. Feel free to post links to papers that address the issue.
NB: I would also appreciate links to Python libraries that might be helpful in this regard.
Topic similar-documents graphs text-mining nlp similarity
Category Data Science