Multicollinearity(Variance Inflation Factor). Variables to remove before doing a model
I am doing an exercise of a Machine Learning System module in python that takes a dataset of cars (cylinders, year, consumption....) and asks for a model, being the variable to predict the consumption of gasoline. As it has three categorical variables, I have generated the dummies.
In the exercise I need to eliminate the variables with multicollinearity, so I used the method showed on my course notes:
from sklearn.linear_model import LinearRegression
def calculateVIF(data):
features = list(data.columns)
num_features = len(features)
model = LinearRegression()
result = pd.DataFrame(index = ['VIF'], columns = features)
result = result.fillna(0)
for ite in range(num_features):
x_features = features[:]
y_featue = features[ite]
x = data[x_features]
y = data[y_featue][x_features], data[y_featue])
result[y_featue] = 1/(1 - model.score(data[x_features], data[y_featue]))
return result
Then if I launch the method it calculates a coefficient for each variable:
In my course notes it is said:
- $VIF5$ is a high value.
- $VIF10$ is a very high value
What should I do? I need to remove the variables that have a $VIF10$ before executing the model?
The problem I see, for my categorical variable cylinders, is only cylinders_5 has a VIF under 10 so should I remove the others and leave cyclinders_5?
Topic collinearity scikit-learn machine-learning
Category Data Science