From logistic regression to XGBoost - selecting features to run the model with
I have been asked to look at XGBoost (as implemented in R, and with a maximum of around 50 features) as an alternative to an already existing but not developed by me logistic regression model created from a very large set of credit risk data containing a few thousand predictors.
The documentation surrounding the logistic regression is very well prepared, and as such track has been kept of the reasons for exclusion of each variable. Among those are:
- automated data audit (through internal tool) - i.e. detected excessive number of missings, or incredibly low variance, etc.;
- lack of monotonic trend - for u-shaped variables after attempts at coarse classing;
- high correlation (70%) - on raw level or after binning;
- low GINI / Information Value - on raw level or after binning;
- low representativeness - assessed through population stability index, PSI;
- business logic / expert judgement.
A huge number of the variables are derived (incl. aggregates like min / max / avg of the standard deviation of other predictors) and some have been deemed too synthetic for inclusion. We have decided to not use those in XGBoost either.
The regression was initially ran with 44 predictors (the output of a stepwise procedure), whereas the final approved model includes only 10.
Because I am rather new to XGBoost, I was wondering whether the feature selection process differs substantially from what has already been done in preparation for the logistic regression, and what some rules / good practices would be.
Based on what I have been reading, perfect correlation and missing values are both automatically handled in XGBoost. I suspect monotonicity of trend should not be a concern (as the focus, unlike in regression, is on non-linear relations), hence binning is likely out; otherwise I am a bit unsure about handling of u-shaped variables. Since GINI is used in deciding on the best split in decision trees in general under the CART (“Classification and Regression Trees”) approach, maybe this is one criterion that it is worth keeping.
I have been entertaining the idea of perusing our internal automated data audit tool, removing std aggregates (too synthetic as per above), removing low GINI and low PSI variables, potentially treating for very high (95+%) correlation, and then applying lasso / elastic net and taking it from there. I am aware that Boruta is relevant here but as of now still have no solid opinion on it.
Topic boruta xgboost logistic-regression feature-selection
Category Data Science