Conditional Entropy and Mutual Information - Clustering evaluation
First of all, I am doing clustering and I have the true labels for my data. For evaluation, I am using the weighted average of the entropy values for each predicted cluster. I also came across with Mutual Information as a similar approach while going over the alternatives. On my data, they seem to give similar results.
However there is one issue that puzzles me.
Given the predicted cluster set $U$ and true clusters $V$, mutual information was defined as: $$ I(U,V) = H(U) - H(U|V) $$ or, $$ I(U,V) = H(V) - H(V|U) $$ If my math is correct, the average entropy that I'm using corresponds to conditional entropy term $H(V|U)$ and trying to minimize this aligns with maximizing the mutual information.
What I cannot see is how weigthed average entropy would differ from mutual information and why we would need the entropy terms $H(U)$ or $H(V)$. It feels like minimising one of the conditional entropies should suffice.
To put it another way, as far as I understood from the equations, having high entropy for true or predicted clusters in itself also results in higher mutual information. Does this mean that mutual information favors equally-sized clusters?
Thanks in advance.
Topic mutual-information information-theory evaluation clustering
Category Data Science