Multicollinearity vs Perfect multicollinearity for Linear regression

I have been trying to understand how multicollinearity within the independent variables would affect the Linear regression model. Wikipedia page suggests that only when there is a perfect multicollinearity, one of the independent variables would have to be removed from training.

Now my question is that should we only remove one of the columns if the correlation is equal to +/- 1 or do we consider a threshold (say 0.90) after which it should be considered as perfect multicollinearity.

Topic collinearity linear-regression

Category Data Science


The following paper explains very well the trade-off in removing variables.

See (2005). Graphical Views of Suppression and Multicollinearity in Multiple Linear Regression. The American Statistician: Vol. 59, No. 2, pp. 127-136.

Addendum: the paper studies the balancing act between colinearity effects and model fit, i.e., whether suppression and enhancement effects in regression offset colinearity issues.


Multicolinearity is not necessarily a problem for regression, it is just a fact of life. Unless you can use a designed experiment, variables "in nature" are often correlated and you have to live with that. Of course, if nature was orthogonal (statistical) life would be simpler ...

All of this is much discussed at our sister site Cross Validated, see for instance Is multicollinearity really a problem?

An economics' blog making fun of unnecessary preoccupation with colinearity is at Multicollinearity and Micronumerosity 14.


This depends on context. Computationally, only a correlation of +/- 1 is problematic, because then there is no unique solution to the OLS criterion. Very strong correlation between predictor variables will may inflate standard errors. This indicates that the parameter estimates become less precise with multicollinearity. Predictive accuracy is often not hurt much by this, but if you want to do inference (e.g., significance tests), it may be more of a problem. If predictors are very strongly correlated, you may then be better of by picking only the best predictors for your regression model, or doing some kind of dimension reduction first (e.g., PCA).


Often the Variance Inflation Factor (VIF) is used to determine if some variables are (too) strongly correlated. VIF = 10 is a well accepted threshold for incuding/excluding variables. So also variables with "too high" correlation should be excluded from linear regression ("OLS") as described here and here.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.