Feature Selection before modeling with Boosting Trees
I have read in some papers that the subset of features chosen for a boosting tree algorithm will make a big difference on the performance
so I've been trying RFE, Boruta, Clustering variables, correlation, WOE IV and Chi-square
Let's say I have a classification problem with over 40 variables, best results after a long long time testing :
- all variables for Lightgbm (except of one variable with high linearity)
- I removed correlated variables for Xgboost (around 8 correlated ones)
- I removed variables based on ElasticNet model for Catboost (around 7 ones)
My question is : what's the proper way to choose the candidates variables for modeling a boosting tree (especially for Lightgbm) ?
I'm using R if there is any suggestion for packages ?
Topic catboost lightgbm xgboost feature-selection r
Category Data Science