Possible harm in standardizing one-hot encoded features

Question

Possible harm in standardizing one-hot encoded features

thereandhere1

2020年11月4日 14:44

While there may not be any added value in standardizing one-hot encoded features prior to applying linear models, is there is any harm in doing so (i.e., affecting model performance)?

Standardizing definition: applying (x - mean) / std to make the feature mean and std 0, 1 respectively)

I prefer applying standardization to my entire training dataset after one-hot encoding, rather than applying it only to the numerical features. I feel it would significantly simplify my pipeline.

For example, if I have a binary feature then the vector that will be provided to the model is [1,1,0,0,0,1,1].

If standardization is applied to this binary feature prior to fitting the model (subtract mean = ~0.57 and divide by std = ~ 0.49), the vector will become

[ 0.8660254 , 0.8660254 , -1.15470054, -1.15470054, -1.15470054, 0.8660254 , 0.8660254 ]

Topic collinearity pipelines one-hot-encoding linear-regression

Category Data Science

Ben Reiniger · Accepted Answer · 2020年11月4日 14:44

With unpenalized linear models, there is no difference. The coefficients will just scale to counteract the new scale of the variables, and the intercept will shift to compensate for the centering.

With penalized linear models though, there will be a difference. Since the standard deviation of a binary variable is at most $1/2$, you'll be increasing the overall scale of the variable by standardizing. That will cause the unpenalized coefficient to decrease in magnitude, which will change the balance of how the penalty applies to different features. I suspect there is no "better" approach then: sometimes the penalty improves performance when the dummies are scaled, and sometimes degrades performance.

Brian Spiering · Accepted Answer · 2020年10月3日 18:44

It makes no sense to standardize one-hot encoded features. One-hot encoding implies the level of the measurement for a feature is nominal / categorial. Standardization implies the level of measure for a features is at least interval.

For example, if the feature is country of origin. Since that feature is categorical, one-hot encoding makes sense. A person is from a country or not. Taking the mean of the country of origin yields numbers that do not make sense.

Possible harm in standardizing one-hot encoded features

About