In which cases shouldn't we drop the first level of categorical variables?
Beginner in machine learning, I'm looking into the one-hot encoding concept.
Unlike in statistics when you always want to drop the first level to have k-1
dummies (as discussed here on SE), it seems that some models needs to keep it and have k
dummies.
I know that having k
levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1
levels.
But since pandas.get_dummies()
has its drop_first
argument to false
by default, this definitely has to be useful sometimes.
In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k
levels for each categorical variable?
EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?
Topic dummy-variables encoding algorithms machine-learning
Category Data Science