how does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?

Based on the paper by Chen Guestrin (2016) "XGBoost: A Scalable Tree Boosting System", XGBoost's "exact split finding algorithm enumerates over all the possible splits on all the features to find the best split" (page 3). Thus, my understanding was that XGBoost enumerates over all features and uses unique values of each feature as candidate split points to then choose the split value that maximizes the splitting criterion (gain).

Then, my question is why do the split values for float-type features chosen are often not one of the unique values for that feature? For example, for a certain feature in the data with floating point values, like so: 966.0, 1234.0, 2350.0, 4567.0 .... If Xgboost chooses to split on that feature, the split value can be, e.g. (feature 1800.5), which is not one of the unique values for that feature, but is often the mid-point between two values.

Also, assuming no missing values in the data, for the ordinal type features, with the values like so: set(1, 2, 3, 4, 5, 6, 7, 8), declared as integers, how does XGBoost enumerate candidate split points? Does the algorithm treat that feature as categorical?

I've read a lot of good answers on the forum, like here and here, have gone thru xgboost documentation and tutorials, but they don't seem to elucidate on the specific procedure of enumerating candidate split points for exact greedy algorithm.

Could someone, perhaps, familiar with the source code could shed light on these questions?

Topic boosting xgboost decision-trees machine-learning

Category Data Science


I dug into this a bit over at this answer, including a Colab notebook. I'll mostly refer you to that answer for the references, but I'll summarize here:

At some point in time, xgboost had different behavior for low-cardinality discrete features, splitting at the actual data values; whereas continuous features have always(?) split at midpoints between training data values. The current behavior though is to always split at midpoints, with no special treatment of discrete features. (Unordered categoricals now have some experimental handling, but that's an entirely different matter.)


In your quote from the paper:

exact split finding algorithm enumerates over all the possible splits on all the features to find the best split

"all possible splits" can be understood to mean splits of the training data into two pieces (left and right), and so doesn't need to mean anything about whether the actual feature values or some point between values is used.


The splitting criteria for float-type features is covered in paper in section "3.3 Weighted Quantile Sketch" and the Appendix.

Quantile sketch divides the data into weighted percentages and does the split based on that approximation.

The split value does not have to unique value in the feature because the algorithm uses approximations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.