Multicollinearity(Variance Inflation Factor). Variables to remove before doing a model

I am doing an exercise of a Machine Learning System module in python that takes a dataset of cars (cylinders, year, consumption....) and asks for a model, being the variable to predict the consumption of gasoline. As it has three categorical variables, I have generated the dummies.

In the exercise I need to eliminate the variables with multicollinearity, so I used the method showed on my course notes:

from sklearn.linear_model import LinearRegression

def calculateVIF(data):
    features = list(data.columns)
    num_features = len(features)

    model = LinearRegression()

    result = pd.DataFrame(index = ['VIF'], columns = features)
    result = result.fillna(0)

    for ite in range(num_features):
        x_features = features[:]
        y_featue = features[ite]

        x = data[x_features]
        y = data[y_featue][x_features], data[y_featue])

        result[y_featue] = 1/(1 - model.score(data[x_features], data[y_featue]))

    return result

Then if I launch the method it calculates a coefficient for each variable:

In my course notes it is said:

  • $VIF5$ is a high value.
  • $VIF10$ is a very high value

What should I do? I need to remove the variables that have a $VIF10$ before executing the model?

The problem I see, for my categorical variable cylinders, is only cylinders_5 has a VIF under 10 so should I remove the others and leave cyclinders_5?

Here is a code I have written to handle Multicollinearity in a dataset. This code snippet is able to handle the following listed items:

  • Multicollinearity using Variable Inflation Factor (VIF), set to a default threshold of 5.0
  • You just need to pass the dataframe, containing just those columns on which you want to test multicollinearity.
  • This function will drop those columns which contains just 1 value. For a bit more further details on this point, please have a look my answer on How to run a multicollinearity test on a pandas dataframe?.
  • The calculation of VIF is parallelized over multiple cores.
    from joblib import Parallel, delayed
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    def removeMultiColl(data, vif_threshold = 5.0):
        for i in data.columns:
            if data[i].nunique() == 1:
                print(f"Dropping {i} due to just 1 unique value")
                data.drop(columns = i, inplace = True)
        drop = True
        col_list = list(data.columns)
        while drop == True:
            drop = False
            vif_list = Parallel(n_jobs = -1, verbose = 5)(delayed(variance_inflation_factor)(data[col_list].values, i) for i in range(data[col_list].shape[1]))
            max_index = vif_list.index(max(vif_list))
            if vif_list[max_index] > vif_threshold:
                print(f"Dropping column : {col_list[max_index]} at index - {max_index}")
                del col_list[max_index]
                drop = True
        print("Remaining columns :\n", list(data[col_list].columns))
        return data[col_list]

Good luck !

Never remove features from your dataset. Always try to make use of them. Try using some DR techniques like PCA to eliminate the multicollinearity between the features. Removing features means you are losing some info. unless Multicollinearity means that the correlation between them is 1 one then you can delete them safely. Using Tree-based models will capture these little differences between features.

1) First, you need to do variable regression i.e for each column in your data set you do simple linear regression and calculate p-value... Thereby you get an idea of the significance of each column against the target variable.

2) plot influence plot check the cooks_d value

 import statsmodels.api as sm
  infl = model1.get_influence()
  sm_fr = infl.summary_frame()

3) You will get cooks_d value from sm_fr data frame

4)select the row point with a cooks_d value>1 and remove that row from your data frame,, now you have removed influential points. 5)Now check VIF values for new set data frame containing variables and remove the variables having vif>5 as they are insignificant ... you can also check their significance calcualting p value .

for overall procedure of building a multi linear regression model satisfying all assumotions of multilinear regression like linearity,homosedasticity,multivariate normality and no multicollineaity see the below example of prediction of profit of start- ups


