Why is linear regression not doing worse with a low weighted attribute?

I've been able to build a few linear regression models that can predict a material strength quite well: minimum RMSE of 17.95 using 11 attributes that I have selected from 159 original attributes. The data is distributed with mean=234.4 and stdev=19.9. I am working in Orange3.

When using only the highest weighted attribute (weight 8.013) the model calculates RMSE of 18.767. If I use only the lowest weighted attribute (weight 0.051) the RMSE is 20.007. The difference is 1.24, or roughly 7% of the good RMSE. Why is there not a bigger difference? I would have thought that using only the attribute with almost no weight would cause the model to predict a completely incorrect value for the target variable.

Input data is 3700 instances (cleaned and correct). I am using 10-fold cross validation. The RMSE is only slightly over the standard deviation of the data -- is it just a case of luck, or what is the reason for the quite low difference in RMSE?

Topic rmse machine-learning-model linear-regression machine-learning

Category Data Science


It is possible that your highest and lowest weighted attributes are highly correlated and that is why the difference in RMSE scores is low. Both features are good predictors of the target, but one is better and that is reflected by the higher weight.

Try running a full rank PCA to transform your d-dimension (correlated features) into d-dimension uncorrelated features. Running your regression on these uncorrelated features. Doing this will not improve your regression model, but you may begin noticing a stark difference between the RMSE for the highest and lowest weighted features.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.