Choice of the number of topics (clusters) in textual data

Question

Choice of the number of topics (clusters) in textual data

Himan

2022年4月6日 11:51

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding in the picture seem inconsistent.

I'm struggling with the choice of the number of clusters. So the question is: what number would you choose from the plot? Moreover, do you think there are other ways and/or conventional rules that one can rely on to choose the number of clusters?

Topic text-mining lda topic-model nlp clustering

Category Data Science

Pluviophile · Accepted Answer · 2022年4月6日 11:51

You can try Perplexity or Coherence Score for the number of topics.

Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of k, you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents. - Lower the better.
Coherence Score is defined as the average/median of the pairwise word-similarity scores of the words in the topic - The higher the better.

Perplexity R implementation

Choice of the number of topics (clusters) in textual data

About