Why is HistGradientBoostingRegressor in sklearn so fast and low on memory?

Question

Why is HistGradientBoostingRegressor in sklearn so fast and low on memory?

ro23

2021年12月6日 11:03

I trained multiple models for my problem and most ensemble algorithms resulted in lengthy fit and train time and huge model size on disk (approx 10GB for RandomForest) but when I tried HistGradientBoostingRegressor from sklearn the fit and training time is just around 10 sec and model size is also low (approx 1MB) with fairly accurate predictions. I was trying out GradientBoostRegressors when I came across this histogram based approach. It outperforms other algorithms in time and memory complexity. I understand it is based on LightGBM from microsoft which is gradient boost optimised for time and memory but I would like to know why is it faster (in more simple english than explained in docs) and low on memory? If you could post some resources which explain this better, would help too.

Topic gradient-boosting-decision-trees lightgbm supervised-learning memory scikit-learn

Category Data Science

Ben Reiniger · Accepted Answer · 2021年6月27日 14:24

In case you hadn't seen the User Guide section for this method, the explanation there is pretty good:

These fast estimators first bin the input samples X into integer-valued bins (typically 256 bins) which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees.

In the usual tree-building algorithm, for a continuous feature, every split point between consecutive data values is considered. By binning, the number of split candidates is vastly reduced (for large datasets with continuous features). Memory needs are also reduced, because the actual feature values aren't always needed, just counts (and other statistics?) among the bins.

Why is HistGradientBoostingRegressor in sklearn so fast and low on memory?

About