What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

Question

What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

UchuuStranger

2020年11月24日 23:35

So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we don't really lose anything by doing that.

This course does not explain what the dummy trap exactly is, however. Neither it gives any examples on how the trap manifests itself. At first I assumed that dummy trap simply makes the model performance less accurate due to multicollinearity. But then I read this article. It does not mention dummy trap explicitly, but it does discuss how an attempt to use OHE with OLS results in an error (since the model attempts to invert a singular matrix). Then it shows how the practice of dropping one dummy feature fixes this. But then it goes on to demonstrate that this measure is unnecessary in practical cases, as apparently regularization fixes this issue just as well, and algorithms that are iterative (as opposed to closed-form solution) don't have this issue in the first place.

So I'm confused right now in regards to what exactly stands behind the term dummy trap. Does it refer specifically to this matrix inversion error? Or is it just an effect that allows the model to get trained but makes its performance worse, and the issue described in that article is totally unrelated? I tried training an sklearn LinearRegression model on a OHE-encoded dataset (I used pd.get_dummies() with the drop_first=False parameter) to try to reproduce the dummy trap, and the latter seems to be the case: the model got trained successfully, but its performance was noticeably worse compared to the identical model trained on the set with drop_first=True. But I'm still confused about why my model got successfully trained at all, since if the article is to be believed, the inversion error should have prevented it from being successfully trained.

Topic dummy-variables one-hot-encoding

Category Data Science

10xAI · Accepted Answer · 2020年11月24日 08:38

There are two main problems -

You have one Feature which is correlated (multi-collinearity) to all the others.
If you are trying to solve using "closed-form solution", the following will happen

$y = w_0 + w_1X_1 + w_2X_2 + w_3X_3$
$w_0$ is the $y$ intercept and to complete the matrix form 1=$X_0$. Hence,
$y = w_0X_0 + w_1X_1 + w_2X_2 + w_3X_3$

Solution for $w$ is $(X^{T}X)^{-1}X^{T}y$

So, X must be an Invertible Matrix. But,
If the model contains dummy variables for all values, then the encoded columns would add up (row-wise) to the intercept ($X_0$ here)(See below table) and this linear combination would prevent the matrix inverse from being computed (as it is singular).

\begin{array} {|r|r|} \hline X_0 & X1 &X2 &X3 \\ \hline 1 &1 &0 &0 \\ \hline 1 &0 &1 &0 \\ \hline 1 &0 &0 &1 \\ \hline \end{array}

why my model got successfully trained at all since if the article is to be believed, the transposition error should have prevented it from being successfully trained.

Valid question! Aurelien Geron(Author of "Hands-On Machine Learning" has answered Here.
- The LinearRegression(Scikit-Learn) class actually performs SVD decomposition, it does not directly try to compute the inverse of X.T.dot(X). The singular values of X are available in the singular_ instance variable, and the rank of X is available as rank_

On Performance

In a practically large dataset, a closed-form solution is not preferred. May use an Iterative approach algorithm i.e. Gradient-Descent
Multi-collinearity too, will not impact the performance but Interpretability of Features.
Coeff changes depending upon which dummy is removed - This is obvious as each dummy is now a Feature with a different level of contribution (based on data). The only thing that is sure is their effect together and one-less is the same.
The inconsistent result you are getting should be due to some other issue.

Dummy-variable Trap

I have never heard of this term except "Udemy course A-Z ML". So I don't think that there is any special meaning of the word "trap" if you understand the points(i.e. Singularity, Multi-collinearity, and Interpretability) separately

References -
www.feat.engineering - Sec#5.1
Sebastian Raschka
stats.stackexchange

What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

On Performance

About