Need help understanding xgboost's approximate split points proposal

background:

in xgboost the $t$ iteration tries to fit a tree $f_t$ over all $n$ examples which minimizes the following objective:

$$\sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$

where $g_i, h_i$ are first order and second order derivatives over our previous best estimation $\hat{y}$ (from iteration $t-1$):

  • $g_i=d_{\hat{y}}l(y_i, \hat{y}) $
  • $h_i=d^2_{\hat{y}}l(y_i, \hat{y}) $

and $l$ is our loss function.


The question (finally):

When building $f_t$ and considering a specific feature $k$ in a specific split, they use the following heuristic to assess only some split candidates: They sort all examples by their $x_k$, pass over the sorted list and sum their second derivative $h_i$. They consider a split candidate only when the sum changes more than $\epsilon$. Why is that???

The explanation they give eludes me:

They claim we can rewrite the previous equation like so:

$$\sum_{i=1}^n\frac{1}{2}h_i[f_t(x_i) - g_i/h_i]^2 + constant$$

and I fail to follow the algebra - can you show why is it equal?

And then they claim that "this is exactly weighted squared loss with labels $gi/hi$ and weights $h_i$" - a statement I agree with, but I don't understand how does it relate to the split candidate algorithm they are using...

Thanks and sorry if this is too long for this forum.

Topic xgboost gbm

Category Data Science


I came across these equations too and tried to derive it, in addition to @ihadanny's answer:

\begin{align*} \tilde{\mathcal{L}}^{(t)} &= \sum_{i=1}^n \left[g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] \\ &= \frac{1}{2} \sum_{i=1}^n h_i \left[ f_t^2(\mathbf{x}_i) + 2\frac{g_i}{h_i} f_t(\mathbf{x}_i) \right ] \\ &= \frac{1}{2} \sum_{i=1}^n h_i \left[ \left ( f_t(\mathbf{x}_i) + \frac{g_i}{h_i} \right)^2 - \left( \frac{g_i}{h_i} \right )^2 \right ] \\ &= \frac{1}{2} \sum_{i=1}^n h_i \left ( f_t(\mathbf{x}_i) + \frac{g_i}{h_i} \right)^2 - \frac{1}{2} \sum_{i=1}^n h_i \left( \frac{g_i}{h_i} \right )^2 \\ &= \frac{1}{2} \sum_{i=1}^n h_i \left ( f_t(\mathbf{x}_i) + \frac{g_i}{h_i} \right)^2 + constant \\ \end{align*}

Looks $- \frac{g_i}{h_i}$ is a typo in the paper, it should've been $ \frac{g_i}{h_i}$ (Please let me know if I'm wrong).


And then they claim that "this is exactly weighted squared loss with labels gi/higi/hi and weights hihi" - a statement I agree with, but I don't understand how does it relate to the split candidate algorithm they are using...

  1. If there is only one sample, and you are optimizing the $w$ at $t-t_h$ iteration, it is easy to see that the value would be $w* = -gi/hi$, explaining $(ft - -(gi/hi))^2$

  2. Now you have an entire data set. In a case where the loss function has a identical second derivative, the $w*$ would become $-avg(gi)/const$ instead of $-sigma(gi)/sigma(hi)$. I wrote it in this way because in that case, the $w*$ would be irrelevant to the difference of $hi$ among samples, since there is no difference. However, in reality, when keeping $gi$ unchanged, the $w*$ fluctuates with distribution of $hi$.

I think it explains why it works as it is weighted by $hi$.


Just adding the algebraic part to @Winks answer:

The second equation should have its sign reversed, as in:

$$\sum_{i=1}^n\frac{1}{2}h_i[f_t(x_i) - (-g_i/h_i)]^2 + constant = \sum_{i=1}^n\frac{1}{2}h_i[f_t^2(x_i) + 2\frac{f_t(x_i)g_i}{h_i} + (g_i/h_i)^2] = \sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i) + \frac{gi^2}{2h_i}]$$

The last term is indeed constant: remember that the $g_i$ and $h_i$ are determined by the previous iteration, so they're constant when trying to set $f_t$.

So, now we can claim "this is exactly weighted squared loss with labels $-gi/hi$ and weights $h_i$"

Credit goes to Yaron and Avi from my team for explaining me this.


I won't go into details but the following should help you grasp the idea.

They use Quantiles (Wikipedia) to determine where to split. If you have 100 possible split points, $\{x_1, \cdots, x_{100}\}$ (sorted), you can try the $10$-quantiles split points $\{x_{10}, x_{20}, \cdots, x_{90}\}$ and have a good approximation already. This is what the $\epsilon$ parameter is doing. They consider a split point when the split has $\sim \epsilon N$ more points under it than the last split point. If $\epsilon = 0.01$, you will end up with $\sim 100$ split points, being larger than $\{1\%, 2\%, ..., 99\%\}$ of the other points. They do not consider a new split when "the sum changes more than $\epsilon$" but when the number of points under the current point is larger by $\epsilon$ than the last one.

Now, if you have a lot of continuous points very that are already well classified, it might be useless to split between them. You want to split the parts of your data set that are very wrong, the ones that are difficult to learn. To do so, they use weighted quantiles. This is where the weights play a role. The first $10$-quantile will not be the first point that is larger than $10\%$ of the points, but the first point that is larger than $10\%$ of the weights.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.