How many features do I select when doing feature selection for regression algorithms? Is R2 and RMSE good measures of success for overfitting?
Context: I'm currently crafting and comparing machine learning models to predict housing data. I have around 32000 data points, 42 features, and I'm predicting housing price. I'm comparing Random Forest Regressor, Decision Tree Regressor, and Linear Regression. I can tell there is some overfitting going on, as my initial values vs cross validated values are as follows:
RF: 10 Fold R Squared = 0.758, neg RMSE = -540.2 vs unvalidated R Squared of 0.877, RMSE of 505.6
DT: 10 Fold R Squared = 0.711, neg RMSE = -576.4 vs unvalidated R squared of 0.829 and RMSE of 595.8.
LR: 10 Fold R squared = 0.695, neg RMSE = -596.5 vs unvalidated R squared of 0.823 and RMSE of 603.7
I have already tuned the hyperparameters for RF and DT, so I was thinking about doing feature selection as a next step to cut down on some of this overfitting (especially since I know my feature importances/coefs).I want to do feature selection now with a filter method (i.e. pearsons) as I want to keep the features going into each model consistent.
Question: How would I decide on a number of features to choose using feature selection? Is it arbitrary? Or do I basically just remove all of them that don't have much correlation with the data? Is there a way to spit out an optimized set of features without doing grid search or random search?
Follow up question: Are the R2 and RMSE cross validated values good measures of success for overfitting comparison?