Possible harm in standardizing one-hot encoded features

While there may not be any added value in standardizing one-hot encoded features prior to applying linear models, is there is any harm in doing so (i.e., affecting model performance)?

Standardizing definition: applying (x - mean) / std to make the feature mean and std 0, 1 respectively)

I prefer applying standardization to my entire training dataset after one-hot encoding, rather than applying it only to the numerical features. I feel it would significantly simplify my pipeline.

For example, if I have a binary feature then the vector that will be provided to the model is [1,1,0,0,0,1,1].

If standardization is applied to this binary feature prior to fitting the model (subtract mean = ~0.57 and divide by std = ~ 0.49), the vector will become

[ 0.8660254 , 0.8660254 , -1.15470054, -1.15470054, -1.15470054, 0.8660254 , 0.8660254 ]

Topic collinearity pipelines one-hot-encoding linear-regression

Category Data Science


With unpenalized linear models, there is no difference. The coefficients will just scale to counteract the new scale of the variables, and the intercept will shift to compensate for the centering.

With penalized linear models though, there will be a difference. Since the standard deviation of a binary variable is at most $1/2$, you'll be increasing the overall scale of the variable by standardizing. That will cause the unpenalized coefficient to decrease in magnitude, which will change the balance of how the penalty applies to different features. I suspect there is no "better" approach then: sometimes the penalty improves performance when the dummies are scaled, and sometimes degrades performance.


It makes no sense to standardize one-hot encoded features. One-hot encoding implies the level of the measurement for a feature is nominal / categorial. Standardization implies the level of measure for a features is at least interval.

For example, if the feature is country of origin. Since that feature is categorical, one-hot encoding makes sense. A person is from a country or not. Taking the mean of the country of origin yields numbers that do not make sense.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.