Entropy loss from collapsing/merging two categories

Suppose I am counting occurrences in a sequence. For a classical example, let's say I'm counting how many of each kind of car comes down a highway.

After keeping tally for a while, I see there are thousands of models. But only a handful show up frequently, whereas there are many that show up once or only a few times (iow the histogram resembles exponential decay). When thinking about the statistics of this situation, it hardly seems to matter that I saw this one obscure car this one time as opposed to another obscure car - it doesn't seem like it conveys much information either way. So what if I collapse all the rare models into a single category like "others", to make the data easier to store. How much information will I lose?

I've gotten as far as reducing the problem to a smaller one and finding an upper bound.

  • Collapsing 3 categories A, B and C into one category D is the same as first collapsing categories A and B into category E, then collapsing E and C into F. F will be exactly the same as D. So the final information loss is not path-dependent, and it is sufficient for us to solve information loss from collapsing 2 categories. The result should easily generalize to n categories.
  • For 2 categories, we can re-encode the sequence so that each occurrence of categories A and B is instead recorded as C. However, for each occurrence of C, an additional bit is recorded that shows whether that C came about from A or B. This re-encoding incurs zero information loss. Erasing these bits would effectively collapse A and B into C. Therefore the average information loss from collapsing categories A and B is (1 bit) * ((number of occurrences of A) + (number of occurrences of B)).

Is my logic above correct? Is my upper bound accurate? What is the lower bound/exact solution?

Topic information-theory categorical-data

Category Data Science


I ended up coming up with my own solution. I'm not sure if it's correct so I won't mark it as an answer.

Let's call the original distribution P. Suppose we will be collapsing A and B into a new category X. The entropy of P will be: $S_P = - \sum P_i\ln P_i$ where $i$ is A, B, C, ... Notably, $P_A > 0, P_B > 0, P_X = 0$ .

After collapsing, we get a new distribution $Q$. Now $Q_A=Q_B=0$ and $Q_X =Q_A+Q_B$. This will have entropy $S_Q$.

The entropy we lost by collapsing is $S_P-S_Q = P_A \ln P_A + P_B \ln P_B - Q_X \ln Q_X$. All other terms, such as $P_C\ln P_C$ for category $C$ which is unaffected, are present in both distributions and cancel out.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.