Does PCA helps to include all the variables even if there is high collinearity among variables?

I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.

Topic collinearity pca linear-regression

Category Data Science


When using PCA, you should not try to interpret the single features anymore. The principal components are multiple linear combinations of your variables that should not be related to the original features.

When you want to work on feature importance, you can use random forests or decision trees instead, as described before. You can do it with neural networks as well by randomizing or shuffling one feature, re-train the network, and comparing the performance.


PCA will generate „new“ (transformed) features which are orthogonal (non-correlated). However, since the original features are transformed, you can hardly claim to say a lot about the importance of (original) features based on PCA.

One obvious alternative would be to use a random forest (RF) to determine feature importance. Using tree based models (like RF or tree based boosting) you do not need to care about collinearity in the feature space.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.