Suppose you have data like in this R code:
library(ISLR)
df = ISLR::Auto
df = df[,1:4]
summary(df)
round(cor(df), 2)
If you look at the correlation...
mpg cylinders displacement horsepower
mpg 1.00 -0.78 -0.81 -0.78
cylinders -0.78 1.00 0.95 0.84
displacement -0.81 0.95 1.00 0.90
horsepower -0.78 0.84 0.90 1.00
... it hints that displacement
has "highest" impact on mpg
. You may also expect that mpg
and displacement
are negatively related. However, keep in mind that correlation measures linear effects and does not give a good idea of "how strong" variables are related but rather "how well" a linear fit can possibly "explain" the relation.
So in oder to see which variable leads to "greatest increase" you could employ a linear regression (seems OK in this case according to the correlation).
reg1 = lm(mpg~cylinders+displacement+horsepower,data=df)
summary(reg1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.305268 1.324633 29.673 < 2e-16 ***
cylinders -0.719431 0.434180 -1.657 0.098331 .
displacement -0.029120 0.008623 -3.377 0.000807 ***
horsepower -0.059935 0.013498 -4.440 1.17e-05 ***
Based on the regression coefficients you would suspect that cylinders
has the "greatest" impact. However, you need to keep some things in mind.
- The effect of
cylinders
is conditional on the remaining $x$-variables (so "controlled for" the rest of the $x$)
- The effect of
cylinders
is measured in terms of "an increase in one cylinder would decrease mpg
by -0.719, all other things equal (!!) on average (!!)"
- The effect of
cylinders
is statistically not different from zero. So the effect could be zero or possibly even negative (see the p-value).
- When you summarize the data, you will find that they have a different scale. So
cylinders
is measured in a different "unit" than displacement
etc.
You need to bring all data on the same scale (to have mean = 0
and standard deviation = 1
) in order to compare the coefficients in terms of "size".
df_scaled = data.frame(scale(df))
reg2 = lm(mpg~cylinders+displacement+horsepower,data=df_scaled)
summary(reg2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.497e-16 2.927e-02 0.000 1.000000
cylinders -1.572e-01 9.489e-02 -1.657 0.098331 .
displacement -3.904e-01 1.156e-01 -3.377 0.000807 ***
horsepower -2.956e-01 6.657e-02 -4.440 1.17e-05 ***
With the new scaled data, you come to a different conclusion. Namely that displacement
has "strong" impact. This (sometimes called beta coefficients) is what may come next to what "greatest impact" in the context of regression analysis can mean.
Here the definition is: "a one standard deviation increase in $x$ will lead to a change queal to $\beta$ in $y$. So all $x$ are comparable now (because of scaling to mean = 0, standard deviation = 1).