Adding high p-value and low R square features in linear regression model to improve result

Question

Adding high p-value and low R square features in linear regression model to improve result

Shahnawaz Khan

2022年5月9日 05:00

I am working on a linear regression problem. The features for my analysis have been selected using p-values and domain knowledge. After selecting these features, the performance of $R^2$ and the $RMSE$ improved from 0.25 to 0.85. But here is the issue, the features selected using domain knowledge have very high p-values (0.7, 0.9) and very low $R^2$ (0.002, 0.0004). Does it make sense to add such features even if your model shows improvement in performance. As far I know, according to linear regression, it is preferable to only keep features with low p-values.

Can anyone share their experience? If yes, then how can I back up my proposal of new features with high p-values.

Topic feature-engineering linear-regression feature-selection statistics machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2020年11月27日 15:59

In general, adding more features will increase the quality of model fit.

If your goal is best fitting modeling, add as many features as possible (regardless of p-value).

Sometimes people care about parsimonious models, they are will to lower the overall model fit because they also value a simpler model. Then they apply a threshold to features using p-values.

Adding high p-value and low R square features in linear regression model to improve result

About