Topic modelling on only 24 documents gives the same "topic" for any K

Question

Topic modelling on only 24 documents gives the same "topic" for any K

Luisda

2022年3月19日 23:00

Description:

I have 24 documents, each one of around 2.5K tokens. They are public speeches.

My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization.

I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence.

Issue:

For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) I always get the same combination of top words for all topics. Therefore, they are zero informative.

I have tried removing "useless" words by threshold the TF-IDF value, but still nothing.

Diagnostic:

Trying to understand what might be the cause, I used SVD on the TF-IDF matrix. My matrix is 24 x 8115, which leads to 24 singular values. This is the plot:

As you can see, there is no knee point.

Maybe I can't do this since I only have 24 documents?

Or am I ignoring something fundamental for topic modelling on such a small dataset?

Topic lsi gensim lda topic-model

Category Data Science

Brian Spiering · Accepted Answer · 2022年3月19日 23:00

24 documents is not enough documents to do meaningful topic modeling.

The model is returning the same results every time because it is overparameterized, estimating too many parameters relative to the number of features/instances.

Topic modelling on only 24 documents gives the same "topic" for any K

About