Building a graph out of a large text corpus

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node.

One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond a given threshold. However, I believe it would be better to utilize an algorithm which is based on plain text similarity measures, i.e., an algorithm which does not convert the texts into embeddings. Same as before, I would form a link between two nodes (texts) if their text similarity is beyond a given threshold. Now, the question is: what is the simplest way to measure similarity of two texts, and what would be the more sophisticated ways? I thought about first extracting the keywords out of the two texts, and then calculate Jaccard Index.

Any idea on how this could be achieved is highly welcome. Feel free to post links to papers that address the issue.

NB: I would also appreciate links to Python libraries that might be helpful in this regard.

Topic similar-documents graphs text-mining nlp similarity

Category Data Science


It looks to me like topic modeling methods would be a good candidate for this problem. This option has several advantages: it's very standard with many libraries available, and it's very efficient (at least the standard LDA method) compared to calculating pairwise similarity between documents.

A topic model is made of:

  • a set of topics, represented as a probability distribution over the words. This is typically used to represent each topic as a list of top representative words.
  • for each document, a distribution over topics. This can be used to assign the most likely topic and consider the clusters of documents by topic, but it's also possible to use some subtle similarity between the distribution.

The typical difficulty with LDA is picking the number of topics. A better and less known alternative is HDP, which infers the number of topics itself. It's less standard but there are a few implementations (like this one) apparently. There are also more recent neural topic models using embeddings (for example ETM).


Update

Actually I'm not really convinced by the idea to convert the data into a graph: unless there is a specific goal to this, analyzing the graph version of a large amount of text data is not necessarily simpler. In particular it should be noted that any form of clustering on the graph is unlikely (in general) to produce better results than topic modelling: the latter produces a probabilistic clustering based on the words in the documents, and this usually offers a quite good way to summarize and group the documents.

In any case, it would possible to produce a graph based on the distribution over topics by document (this is the most natural way, there might be others). Calculating a pairwise similarity between these distributions would represent closely related pairs of documents with a high-weight edge and conversely. Naturally a threshold can be used to remove edges corresponding to low similarity edges.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.