Measuring coherence score for Top2Vec models

Question

Measuring coherence score for Top2Vec models

Teefs

2022年2月26日 17:03

I am working on creating a number of Top2Vec models on Reddit threads. I am basically changing the HDBScan cluster sizes to get different clusters of the Doc2Vec embeddings representing a different # of topics.

I am trying to compare different models using their coherence score. I have tried using Gensim's coherence score but failed. I got an error message indicating that a word in the topics is not included in the dictionary.

I also tried using tmtooklit. While I could get the Document Term Matrix (DTM) easily, I have not been able to get the topic-word distribution using Top2Vec.

Questions:

Can I resolve either of the issues indicated above (get the dictionary to list all of the terms necessary or producing the topic-word distribution)?
Are there other metrics that can be used to be compare Top2Vec models?

Topic coherence topic-model nlp

Category Data Science

Diana Guzman · Accepted Answer · 2022年1月25日 21:19

I faced the same issue when I changed the values of the min_count from 50 to 5 for Top2Vec and in my case the words that are not part of the dictionary (most of the times just one) show up almost in all of the cases at the end of the list of words per topic. For my case it was enough to give Gensim just the top 20 words for each topic and it calculated the metric because anyway the other 30 words by default are not considered for the calculation of coherence metrics. If this is not your case I think it would be necessary to define the functions from scratch.

Another metric used for evaluate topic models are perplexity or diversity but coherence metrics are the ones that are closer to human judgement, which is another really expensive way to evaluate topic models.

Leland McInnes · Accepted Answer · 2021年12月12日 18:57

It is possible to compute coherence scores, but you will really need to implement it from scratch yourself from the definitions of coherence I am afraid. Top2Vec doesn't have topic-word distributions. Instead you will be looking at ranking of topic words in terms of their distance from the topic vector in the joint topic/word/document embedding space. Such a ranking is sufficient for many of the types of coherence score.

Measuring coherence score for Top2Vec models

About