Categories with the same mean in target encoding

While doing target encoding it can happen that two categories have the same target mean.

This is bad because there will be no difference in the new feature in it and we will lose some information.

Also, this is potentially harmful to the model, choosing this split in the feature can produce some incongruences.

Is there any way to fix this problem?

Topic target-encoding categorical-encoding categorical-data machine-learning

Category Data Science


I assume tree-based models in this answer

Bayensian Mean Encoding can help

The main problem is that the model will be unable to split between the merged categories. So you implicitly accept that there is no interaction between the two merged categories and other variables. (succession of splits will be the same for cat1 and cat2)

If you want to enable the tree to split anywhere (because you suspect interactions), you have to create a space between the values of cat1 and cat2 for the tree to split. Bayesian Mean Encoding can help you. It takes the frequency of the categories into accounts in the calculation of the target means.

The formula is the following :

\begin{equation} \mu = \frac{n * \bar{x} + m*w}{n+m} \end{equation} where :

  • $\mu$ is the category mean
  • $n$ is the frequency of this category
  • $\bar{x}$ is the target mean for this category
  • $m$ is the weight you want to give to the overall mean
  • $w$ is the overall mean

nb: with small $m$ you will have a result very similar to simple target encoding while moving slightly the means (enough to enable the split). The main goal of Bayesian Mean is to limit overfitting, which is also a problem when doing target encoding.

There is no point of doing that

The idea behind target encoding is to make the assumption that the category has no interaction with other variables. So you convert the categorical space (high dimensional) into a simple continuous space, where only the value counts. If you think that the algorithm might confuse two categories and that this is bad, it means target encoding is not suited. If different categories should follow a different decision path, then it means there are interactions between the categorical variable and the others.

If you want every category to have different decisional paths why don't you keep categories? Because implicitly, the tree will compute mean of the splits (explained in this video: https://www.youtube.com/watch?v=g9c66TUylZ4)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.