How many features do I select when doing feature selection for regression algorithms? Is R2 and RMSE good measures of success for overfitting?

Context: I'm currently crafting and comparing machine learning models to predict housing data. I have around 32000 data points, 42 features, and I'm predicting housing price. I'm comparing Random Forest Regressor, Decision Tree Regressor, and Linear Regression. I can tell there is some overfitting going on, as my initial values vs cross validated values are as follows:

RF: 10 Fold R Squared = 0.758, neg RMSE = -540.2 vs unvalidated R Squared of 0.877, RMSE of 505.6

DT: 10 Fold R Squared = 0.711, neg RMSE = -576.4 vs unvalidated R squared of 0.829 and RMSE of 595.8.

LR: 10 Fold R squared = 0.695, neg RMSE = -596.5 vs unvalidated R squared of 0.823 and RMSE of 603.7

I have already tuned the hyperparameters for RF and DT, so I was thinking about doing feature selection as a next step to cut down on some of this overfitting (especially since I know my feature importances/coefs).I want to do feature selection now with a filter method (i.e. pearsons) as I want to keep the features going into each model consistent.

Question: How would I decide on a number of features to choose using feature selection? Is it arbitrary? Or do I basically just remove all of them that don't have much correlation with the data? Is there a way to spit out an optimized set of features without doing grid search or random search?

Follow up question: Are the R2 and RMSE cross validated values good measures of success for overfitting comparison?

Topic rmse overfitting pearsons-correlation-coefficient regression feature-selection

Category Data Science


You have overfitting when your model corresponds too closely to the training data and may therefore fail to fit additional data or predict future observations reliably. Basically, when the performance on training (or validation) set are much more better than on test set. You have the opposite case: performance on test set much better than validation set.

This can happen if the two sets do not come from the same distribution (how did you divide train/validation/test set?). In this case, the data of the test set could be much easier to predict.

Another possibility is that the size of the test set is too small.

My suggestion is: shuffle your dataset; divide in 70% training and 30% test set. Do cross validation on training set. Compute the R2 on both sets.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.