In which cases shouldn't we drop the first level of categorical variables?

Question

In which cases shouldn't we drop the first level of categorical variables?

Dan Chaltiel

2020年11月26日 18:58

Beginner in machine learning, I'm looking into the one-hot encoding concept.

Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.

I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.

But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.

In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?

EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?

Topic dummy-variables encoding algorithms machine-learning

Category Data Science

Ben Reiniger · Accepted Answer · 2020年8月18日 16:38

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.

In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.

K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.

See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)

Denis · Accepted Answer · 2020年8月18日 16:32

Small addition, as this was still not mentioned, and I was also searching for it. There is an explanation here.
for the drop parameter:

drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

In which cases shouldn't we drop the first level of categorical variables?

About