Understanding one of the assumptions of linear regression: Multicollinearity

Question

Understanding one of the assumptions of linear regression: Multicollinearity

Mark G

2020年8月14日 05:17

I've read that multicollinearity is one of the main assumptions of multivariate linear regression - Multicollinearity occurs when the independent variables are too highly correlated with each other.

However, when learning linear regression, one of the key topics is the idea of introducing interaction terms into the model to model the interaction effect which is when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables.

Aren't these two statements contradictory? If there were really interactions between $X_1$ and $X_2$ in the model $y = \beta_0 + \beta_1*X_1 + \beta_2*X_2$ surely we should remove either $X_1$ or $X_2$ so that the independent variables in the regression model are no longer correlated thus the multicollinearity assumption holds. Putting in the interaction terms seems to ignore this assumption and rather introduce a further term to further complicate it.

From a modeling standpoint this makes sense, but doesn't the mathematics breakdown if we do this?

Topic collinearity linear-regression regression predictive-modeling

Category Data Science

10xAI · Accepted Answer · 2020年8月14日 05:17

Interaction effect and Colinearity have two different meanings.

Multi-colinearity simply tells if two or more predictors are correlated i.e. change in one changes the other. I believe there is no confusion about this as you have also mentioned in your question.
But Multi-colinearity doesn't need the response variable for it to be figured out.

Interaction by definition, are always in the context of how predictors relate to the outcome.

More formally, two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone.

A simple and intuitive example can be - consider the effects of water and fertilizer on the yield of a field corn crop. With no water but some fertilizer, the crop of field corn will produce no yield since water is a necessary requirement for plant growth. Conversely, with a sufficient amount of water but no fertilizer, a crop of field corn will produce some yield. However, the yield is best optimized with a sufficient amount of water and a sufficient amount of fertilizer. Hence water and fertilizer, when combined in the right amounts, produce a yield that is greater than what either would produce alone

It is represented as - $y =β_0+β_1x_1+β_2x_2+β_3x_1x_2+error$

$β_3$ explains the interaction between $x_1$ and $x_2$

Interaction can be Additive (no interaction), Synergistic(+ve), Antagonistic(-ve), Atypical depending on different parameters values.

Reference
Quotes, example and the equation is from -
Feature Engineering and Selection: A Practical Approach for Predictive Models, Max Kuhn and Kjell Johnson
Good read for an in-depth explanation of different Feature engineering concepts.
Internet link

Understanding one of the assumptions of linear regression: Multicollinearity

About