How to do backward features elimination when considering interactions between them

I have a multi linear regression problem,

$Y$ is my target and $X_1, X_2, X_3$ are my features.

In my regression, I consider the interaction between $X_1, X_2, X_3$ and I add a bias.

So my problem is given by : $Y \sim X_1 + X_2 + X_3 + X_1X_2 + X_1X_3+ X_2X_3+ bias$

Now, I fit my model with statsmodels.api.sm and I want to eliminate the feature the highest p value recursively.

  • My first question is : for example, if the highest p value is for the $X_1X_2$ feature, is it okay to eliminate this feature even when $X_1$ and $X_2$ can be statistically significant ?
  • My second question : in the case when all the interaction of some feature have a p value greater than 0.05 in the first iteration, Could I eliminate this feature and all the interactions ?

Thank you for your help

Topic statsmodels linear-regression feature-selection

Category Data Science


My first question is : for example, if the highest p value is for the X1X2 feature, is it okay to eliminate this feature even when X1 and X2 can be statistically significant ?

Of course, the interaction can have no information about the target. Per example if the problem is perfectly defined by X1 and X2. The interaction $X_1 \cdot X_2$ won't add nothing to the model.

My second question : in the case when all the interaction of some feature have a p value greater than 0.05 in the first iteration, Could I eliminate this feature and all the interactions ?

I would try a more experimental approach of removing them only if they don't improve the model accuracy rather than having a low P-Value.

As a further reccomendation I would reccomend sklearn.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.