Choice of the number of topics (clusters) in textual data

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding in the picture seem inconsistent.

I'm struggling with the choice of the number of clusters. So the question is: what number would you choose from the plot? Moreover, do you think there are other ways and/or conventional rules that one can rely on to choose the number of clusters?

Topic text-mining lda topic-model nlp clustering

Category Data Science


You can try Perplexity or Coherence Score for the number of topics.

  1. Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of k, you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents. - Lower the better.
  2. Coherence Score is defined as the average/median of the pairwise word-similarity scores of the words in the topic - The higher the better.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.