How do standardization and normalization impact the coefficients of linear models?

One benefit of creating a linear model is that you can look at the coefficients the model learns and interpret them. For example, you can see which features have the most predictive power and which do not.

How, if at all, does feature interpretability change if we normalize (scale all features to 0-1) all our features vs. standardizing (subtract mean and divide by the standard deviation) them all before fitting the model.

I have read elsewhere that you 'lose feature interpretability if you normalize your features' but could not find an explanation as to why. If that is true, could you please explain?

Here are two screenshots of the coefficients for two multiple linear regression models I built. It uses Gapminder 2008 data and statistics about each country to predict its fertility rate.

In the first, I scaled features using StandardScaler. In the second, I used MinMaxScaler. The Region_ features are categorical and were one-hot encoded and not scaled.

Not only did the coefficients change based on different scaling, but their ordering (of importance?) did too! Why is this the case? What does it mean?

Topic interpretation lasso ridge-regression linear-regression feature-scaling

Category Data Science


I believe with scaling, the coeff. are scaled by the same level i.e. Std. Deviation times with Standardization and (Max-Min) times with Normalization

If we look at all the features individually, we are basically shifting it and then scaling it down by a constant but $y$ is unchanged.

So, if we imaging a line in a 2-D space, we are keeping the $y$ same and squeeing the $x$ by a constant(Let's assume it = $C$).

This implies(Assuming Coeff. = Slope = $tan{\theta}$ = dy/dx) ,
the slope will also increase by the same amount i.e. $C$ times. (Since, dx has been divided by a constant($C$) but dy is same, so $tan{\theta}$ i.e. slope = $C$ * old_slope(i.e. the slope prior to scaling)

We can observe in this snippet that both the coef are in the ratio of the Standard deviation and (Max - Min) respectively w.r.t the unscaled coeff

import sys;import os;import pandas as pd, numpy as np
os.environ['KAGGLE_USERNAME'] = "10xAI" 
os.environ['KAGGLE_KEY'] = "<<Your Key>>" 

import kaggle
!kaggle datasets download -d camnugent/california-housing-prices

dataset = pd.read_csv("/content/california-housing-prices.zip")
y = dataset.pop('median_house_value')
x = dataset.iloc[:,:4]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y)
old_coef = model.coef_  

x_s = (x-x.mean())/x.std()
model.fit(x_s,y)
std_coef = model.coef_  

print("###Ratio of Scaled Coeff and Std. Deviation times Standardized Coeff")
print(std_coef/(old_coef*x.std()))

x_n = (x-x.min())/(x.max()-x.min())
model.fit(x_n,y)
nor_coef = model.coef_  

print("###Ratio of Scaled Coeff and (Max - Min) times Normalized Coeff")
print(nor_coef/(old_coef*(x.max()-x.min())))

enter image description here

So, you can calculate the unscaled Coeff from Standardized and Normalized coeff.

On Importance

The order(Since it's sorted values) might change because the standard deviation will not be equal to (Max - Min).

But this should not impact the importance. Importance should be measured in original data space Or unit should be of standard deviation(as explained by Peter) Or (Max - Min) but that might be not very intuitive for every user.


When you have a linear regression (without any scaling, just plain numbers) and you have a model with one explanatory variable $x$ and coefficients $\beta_0=0$ and $\beta_1=1$, then you essentially have a (estimated) function:

$$y = 0 + 1x .$$

This tells you that when $x$ goes up (down) by one unit, $y$ goes up (down) by one unit. In this case it is just a linear function with slope 1.

Now when you scale $x$ (the plain numbers) like:

scale(c(1,2,3,4,5))
           [,1]
[1,] -1.2649111
[2,] -0.6324555
[3,]  0.0000000
[4,]  0.6324555
[5,]  1.2649111

you essentially have different units or a different scale (with mean=0, sd=1).

However, the way OLS works will be the same, it still tells you "if $x$ goes up (down) by one unit, $y$ will change by $\beta_1$ units. So in this case (given a different scale of $x$), $\beta_1$ will be different.

The interpretation here would be "if $x$ changes by one standard deviation...". This is very handy when you have several $x$ with different units. When you standardise all the different units, you make them comparable to some extent. I.e. the $\beta$ coefficients of your regression will be compareable in terms of how strong the variables impact on $y$ is. This is sometimes calles Beta-Coefficients or Standardised Coefficients.

A very similar thing happens when you normalise. In this case you will also change the scale of $x$, so the way how $x$ is measured.

Also see this handout.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.