Calculate an ambiguity score based on topic models and Hellinger distance
I am trying to calculate some sort of ambiguity score from text based on topic probabilities from a Latent Dirichlet Allocation model and the Hellinger distance between the topic distributions.
Let’s say I constructed my LDA model with 3 topics, these topics are related to basketball, football, and banking, respectively. I would like some kind of score that says that if the topic probabilities of a document is Basketball: $0.33$, Football: $0.33$, and Banking: $0.33$, that document is more ambiguous than a document with topic probabilities Basketball: $0.98$, Football: $0.01$, and Banking: $0.01$. This is because the document would consist predominantly of one single topic.
My first try (without considering Hellinger distance) was simply adding the three squared probabilities, for example:
- $(0.99)^2 + (0.05)^2 + (0.05)^2 = 0.9803$
- $(0.99)^2 + (0.05)^2 + (0.05)^2 = 0.815$
- $(0.50)^2 + (0.25)^2 + (0.25)^2 = 0.375$
- $(0.333)^2 + (0.333)^2 + (0.333)^2 = 0.333$
In my head this result makes sense, $0.333$ across all topics for a document would be the most ambiguous document possible and a document with $1.0$ for one topic and $0.0$ for the other two would be the least ambiguous possible. Therefore, a lower score would mean more ambiguity and a higher score would mean less ambiguity, in this case the lower and upper bound would be $0.333$ and $1.0$, respectively.
However, this number does not consider how similar the topics are to each other. I would also like this score to consider that if a document's topic probabilities are for example Basketball: $0.45$, Football: $0.45$, and Banking: $0.10$, it is less ambiguous than a document with Basketball: $0.45$, Football: $0.10$, and Banking $0.45$, since Basketball and Football are more related than Basketball and Banking. Without considering the topic similarity, the ambiguity scores of these two examples would be the same.
To compare the topics I have tried to use the Hellinger distance (lower meaning more similar) and got that the Hellinger distance between:
- Basketball-Football: $0.3$
- Basketball-Banking: $0.7$
- Football-Banking: $0.35$
How can I make sure that the score from my first try takes these Hellinger distances into consideration? Is my first way of calculating the score appropriate or is there a more suitable way?
In reality, I have ten topics and thus 45 Hellinger distances to consider but for simplicity I am using 3 here.
Topic lda topic-model nlp
Category Data Science