How to interpret Variance Inflation Factor (VIF) results?

From various books and blog posts, I understood that the Variance Inflation Factor (VIF) is used to calculate collinearity. They say that VIF till 10 is good. But I have a question.

As we can see in the below output, the rad feature has the highest VIF and the norm is that VIF till 10 is okay.

How does VIF calculate collinearity when we are passing an entire linear fit to the function? And how to interpret the results given by VIF? Which variables are collinear with which variables?

lm.fit2 = lm(medv~.+log(lstat)-age-indus-lstat, data=Boston)
 summary(lm.fit2)

Call:
lm(formula = medv ~ . + log(lstat) - age - indus - lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3764  -2.5604  -0.3867   1.8456  25.2255 

Coefficients:
              Estimate Std. Error t value Pr(|t|)    
(Intercept)  53.942455   4.823309  11.184   2e-16 ***
crim         -0.126273   0.029185  -4.327 1.83e-05 ***
zn            0.021993   0.012238   1.797 0.072934 .  
chas          2.270669   0.768911   2.953 0.003296 ** 
nox         -13.959428   3.187365  -4.380 1.45e-05 ***
rm            2.619831   0.378737   6.917 1.43e-11 ***
dis          -1.374045   0.166350  -8.260 1.35e-15 ***
rad           0.286993   0.057004   5.035 6.72e-07 ***
tax          -0.010756   0.003033  -3.546 0.000428 ***
ptratio      -0.840540   0.116431  -7.219 1.99e-12 ***
black         0.008015   0.002402   3.336 0.000913 ***
log(lstat)   -8.672865   0.530188 -16.358   2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.258 on 494 degrees of freedom
Multiple R-squared:  0.7904,    Adjusted R-squared:  0.7857 
F-statistic: 169.3 on 11 and 494 DF,  p-value:  2.2e-16

 vif(lm.fit2)
      crim         zn       chas        nox         rm        dis 
  1.755719   2.269767   1.062622   3.800515   1.972845   3.418391 
       rad        tax    ptratio      black log(lstat) 
  6.863674   7.279426   1.770146   1.340023   2.827687 

Topic collinearity linear-regression feature-selection r machine-learning

Category Data Science


The Variance Inflation Factor (VIF) looks at how well a single $x_i$ is determined by all the other $x_i$ (jointly) in your model.

How does the VIF work?

  1. For each $x_i$ in your model, you run a (auxiliary) linear regression: $$ x_{1,i} = \beta_1 + \beta_2 x_{2,i} + ... + \beta_n x_{n,i} + u .$$
  2. You retrieve the $R^2$ for each of these models and calculate the $VIF$: $$ VIF_1 = 1 / (1-R^2_1). $$

Example in R:

Calculate VIF:

library(car)
library(ISLR)
reg = lm(mpg~disp+wt+qsec+hp, data=mtcars)
vif(reg)

Result:

    disp       wt     qsec       hp 
7.985439 6.916942 3.133119 5.166758 

Do this manually (for disp)

rsq = summary(lm(disp~wt+qsec+hp, data=mtcars))$r.squared
1/(1-rsq)

Result:

7.985439

What about the $VIF=10$ rule of thumb?

$VIF = 10$ is equal to having an $R^2=0.9$ in the auxiliary regression in step 1 above (because $1/(1-0.9)=10$). This means that your other $x_i$ (in the model) explain the $x_i$ under consideration to a large extent (90% if you want to say so). This of course is just a rule of thumb.

In essence, the $VIF$ boils down to the question: "How well is one of my $x_i$ explained by all other $x$ jointly".

In your example tax has the highes $VIF$ (tax=7.279426). This means that the auxiliary regression (step 1) for tax has an $R^2=0.862627$. This means that tax is well explained by all the other $x$ so that there may be a problem with multicollinearity.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.