Entropy loss from collapsing/merging two categories
Suppose I am counting occurrences in a sequence. For a classical example, let's say I'm counting how many of each kind of car comes down a highway.
After keeping tally for a while, I see there are thousands of models. But only a handful show up frequently, whereas there are many that show up once or only a few times (iow the histogram resembles exponential decay). When thinking about the statistics of this situation, it hardly seems to matter that I saw this one obscure car this one time as opposed to another obscure car - it doesn't seem like it conveys much information either way. So what if I collapse all the rare models into a single category like "others", to make the data easier to store. How much information will I lose?
I've gotten as far as reducing the problem to a smaller one and finding an upper bound.
- Collapsing 3 categories A, B and C into one category D is the same as first collapsing categories A and B into category E, then collapsing E and C into F. F will be exactly the same as D. So the final information loss is not path-dependent, and it is sufficient for us to solve information loss from collapsing 2 categories. The result should easily generalize to n categories.
- For 2 categories, we can re-encode the sequence so that each occurrence of categories A and B is instead recorded as C. However, for each occurrence of C, an additional bit is recorded that shows whether that C came about from A or B. This re-encoding incurs zero information loss. Erasing these bits would effectively collapse A and B into C. Therefore the average information loss from collapsing categories A and B is (1 bit) * ((number of occurrences of A) + (number of occurrences of B)).
Is my logic above correct? Is my upper bound accurate? What is the lower bound/exact solution?
Topic information-theory categorical-data
Category Data Science