What exactly is a dummy trap? Is dropping one dummy feature really a good practice?
So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we don't really lose anything by doing that.
This course does not explain what the dummy trap exactly is, however. Neither it gives any examples on how the trap manifests itself. At first I assumed that dummy trap simply makes the model performance less accurate due to multicollinearity. But then I read this article. It does not mention dummy trap explicitly, but it does discuss how an attempt to use OHE with OLS results in an error (since the model attempts to invert a singular matrix). Then it shows how the practice of dropping one dummy feature fixes this. But then it goes on to demonstrate that this measure is unnecessary in practical cases, as apparently regularization fixes this issue just as well, and algorithms that are iterative (as opposed to closed-form solution) don't have this issue in the first place.
So I'm confused right now in regards to what exactly stands behind the term dummy trap. Does it refer specifically to this matrix inversion error? Or is it just an effect that allows the model to get trained but makes its performance worse, and the issue described in that article is totally unrelated? I tried training an sklearn LinearRegression model on a OHE-encoded dataset (I used pd.get_dummies()
with the drop_first=False
parameter) to try to reproduce the dummy trap, and the latter seems to be the case: the model got trained successfully, but its performance was noticeably worse compared to the identical model trained on the set with drop_first=True
. But I'm still confused about why my model got successfully trained at all, since if the article is to be believed, the inversion error should have prevented it from being successfully trained.
Topic dummy-variables one-hot-encoding
Category Data Science