From logistic regression to XGBoost - selecting features to run the model with

I have been asked to look at XGBoost (as implemented in R, and with a maximum of around 50 features) as an alternative to an already existing but not developed by me logistic regression model created from a very large set of credit risk data containing a few thousand predictors.

The documentation surrounding the logistic regression is very well prepared, and as such track has been kept of the reasons for exclusion of each variable. Among those are:

  • automated data audit (through internal tool) - i.e. detected excessive number of missings, or incredibly low variance, etc.;
  • lack of monotonic trend - for u-shaped variables after attempts at coarse classing;
  • high correlation (70%) - on raw level or after binning;
  • low GINI / Information Value - on raw level or after binning;
  • low representativeness - assessed through population stability index, PSI;
  • business logic / expert judgement.

A huge number of the variables are derived (incl. aggregates like min / max / avg of the standard deviation of other predictors) and some have been deemed too synthetic for inclusion. We have decided to not use those in XGBoost either.

The regression was initially ran with 44 predictors (the output of a stepwise procedure), whereas the final approved model includes only 10.

Because I am rather new to XGBoost, I was wondering whether the feature selection process differs substantially from what has already been done in preparation for the logistic regression, and what some rules / good practices would be.

Based on what I have been reading, perfect correlation and missing values are both automatically handled in XGBoost. I suspect monotonicity of trend should not be a concern (as the focus, unlike in regression, is on non-linear relations), hence binning is likely out; otherwise I am a bit unsure about handling of u-shaped variables. Since GINI is used in deciding on the best split in decision trees in general under the CART (“Classification and Regression Trees”) approach, maybe this is one criterion that it is worth keeping.

I have been entertaining the idea of perusing our internal automated data audit tool, removing std aggregates (too synthetic as per above), removing low GINI and low PSI variables, potentially treating for very high (95+%) correlation, and then applying lasso / elastic net and taking it from there. I am aware that Boruta is relevant here but as of now still have no solid opinion on it.

Topic boruta xgboost logistic-regression feature-selection

Category Data Science


One of the main advantages of tree-based methods is that "high correlation" is not so much of a problem (when compared to linear models). Another advantage is that there is no parametric assumtion behind the model. Thus you have a good chance that some of the variables you dropped to do the Logit (high correlation/u-shaped) can be used with boosting.

WRT to raw levels and/or low representativeness, it could be that some features are actually useful in the boosting process. I would at least give it a try. Check the feature importance to exclude features with little predictive power.

Missings carry no information, so exclude features with lot of missings. Bossting essentially has the same problems as Logit here. Same is true for features which are excluded by expert advice. However, you could still see if some of the features help. This is a matter of testing.

Note that xgboost requires a matrix in which factors are encoded as dummies. This is highly relevant when you work with R.

You can add early stopping to boost until no more progress is made.

Regarding your list, my feeling (not knowing the data in detail) would be...

Can be used with xgboost:

  • lack of monotonic trend - for u-shaped variables after attempts at coarse classing;
  • high correlation (>70%) - on raw level or after binning;

Worth a try:

  • low GINI / Information Value - on raw level or after binning;
  • low representativeness - assessed through population stability index, PSI;

Exclude:

  • business logic / expert judgement.
  • automated data audit (through internal tool) - i.e. detected excessive number of missings, or incredibly low variance, etc.;

Some ideas for the model:

There are many hyperparameters that can/must be tuned. Make sure you pick the right "Learning Task Parameters" for your problem (likely a right skewed distribution). Have a look at the docs, there are some good hints. Also try regulation (alpha, lambda). PLay with max_depth (usually between 5 and 8 work well). Too small trees do not fit well, too large trees overfit. Look at subsample to fight overfitting by making the tree booster "stochastic". Also look at the learning rate eta. Often a low(er) learning rate works find with large(er) data.

You can also add custom losses which can be helpful. I played around with that in R and you can find some code (Python is not so different) here: https://github.com/Bixi81/R-ml

As an alternative to xgboost, you could also look at LightGBM https://github.com/Bixi81/Python-ml/blob/master/boosting_regression_boston.py

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.