Topic modelling on only 24 documents gives the same "topic" for any K
Description:
I have 24 documents, each one of around 2.5K tokens. They are public speeches.
My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization.
I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence.
Issue:
For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) I always get the same combination of top words for all topics. Therefore, they are zero informative.
I have tried removing "useless" words by threshold the TF-IDF value, but still nothing.
Diagnostic:
Trying to understand what might be the cause, I used SVD on the TF-IDF matrix. My matrix is 24 x 8115, which leads to 24 singular values. This is the plot:
As you can see, there is no knee point.
Maybe I can't do this since I only have 24 documents?
Or am I ignoring something fundamental for topic modelling on such a small dataset?
Topic lsi gensim lda topic-model
Category Data Science