Understanding which variables impact your variable of interest the most (correlation, linear regression) and correctly interpreting results

Question

Understanding which variables impact your variable of interest the most (correlation, linear regression) and correctly interpreting results

Learning_and_xbox

2022年2月11日 18:12

How do you ascertain which variables lead to the greatest increase in another variable of interest?

Let's say you have a correlation matrix. You look at the row of the variable you are particularly curious about, retention, and see that income is the most correlated with it out of all the variables in the matrix.

I would then expect when I look at the highest income cities in my dataset to see them having highest retention but am not finding that to be the case. Why is that?

I am having a similar issue with weighted coefficients in a linear regression as well.

Trying to isolate for which variables to look into to see which impact retention most and am not understanding why highest income areas don't have most retention (if they are most correlated/ have highest weighted coefficient). I'm not trying to do any predictive models.

Any assistance would be greatly appreciated.

Topic feature-importances linear-regression correlation feature-selection

Category Data Science

Peter · Accepted Answer · 2022年2月11日 17:25

Suppose you have data like in this R code:

library(ISLR)
df = ISLR::Auto
df = df[,1:4]

summary(df)
round(cor(df), 2)

If you look at the correlation...

               mpg cylinders displacement horsepower
mpg           1.00     -0.78        -0.81      -0.78
cylinders    -0.78      1.00         0.95       0.84
displacement -0.81      0.95         1.00       0.90
horsepower   -0.78      0.84         0.90       1.00

... it hints that displacement has "highest" impact on mpg. You may also expect that mpg and displacement are negatively related. However, keep in mind that correlation measures linear effects and does not give a good idea of "how strong" variables are related but rather "how well" a linear fit can possibly "explain" the relation.

So in oder to see which variable leads to "greatest increase" you could employ a linear regression (seems OK in this case according to the correlation).

reg1 = lm(mpg~cylinders+displacement+horsepower,data=df)
summary(reg1)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.305268   1.324633  29.673  < 2e-16 ***
cylinders    -0.719431   0.434180  -1.657 0.098331 .  
displacement -0.029120   0.008623  -3.377 0.000807 ***
horsepower   -0.059935   0.013498  -4.440 1.17e-05 ***

Based on the regression coefficients you would suspect that cylinders has the "greatest" impact. However, you need to keep some things in mind.

The effect of cylinders is conditional on the remaining $x$-variables (so "controlled for" the rest of the $x$)
The effect of cylinders is measured in terms of "an increase in one cylinder would decrease mpg by -0.719, all other things equal (!!) on average (!!)"
The effect of cylinders is statistically not different from zero. So the effect could be zero or possibly even negative (see the p-value).
When you summarize the data, you will find that they have a different scale. So cylinders is measured in a different "unit" than displacement etc.

You need to bring all data on the same scale (to have mean = 0 and standard deviation = 1) in order to compare the coefficients in terms of "size".

df_scaled = data.frame(scale(df))
reg2 = lm(mpg~cylinders+displacement+horsepower,data=df_scaled)
summary(reg2)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.497e-16  2.927e-02   0.000 1.000000    
cylinders    -1.572e-01  9.489e-02  -1.657 0.098331 .  
displacement -3.904e-01  1.156e-01  -3.377 0.000807 ***
horsepower   -2.956e-01  6.657e-02  -4.440 1.17e-05 ***

With the new scaled data, you come to a different conclusion. Namely that displacement has "strong" impact. This (sometimes called beta coefficients) is what may come next to what "greatest impact" in the context of regression analysis can mean.

Here the definition is: "a one standard deviation increase in $x$ will lead to a change queal to $\beta$ in $y$. So all $x$ are comparable now (because of scaling to mean = 0, standard deviation = 1).

Understanding which variables impact your variable of interest the most (correlation, linear regression) and correctly interpreting results

About