Output value of a gradient boosting decision tree node that has just a single example in it

The general gradient boosting algorithm for tree-based classifiers is as follows:

Input: training set $\{(x_{i},y_{i})\}_{i=1}^{n}$, a differentiable loss function $L(y,F(x))$, and a number of iterations $M$.

Algorithm:

  1. Initialize model with a constant value: $\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}\sum _{i=1}^{n}L(y_{i},\gamma )$.

  2. For m = 1 to M:

    1. Compute so-called pseudo-residuals: $$r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.$$
    2. Fit a base learner (or weak learner, e.g. tree) $\displaystyle h_{m}(x)$ to pseudo-residuals, i.e. train it using the training set $\{(x_{i},r_{im})\}_{i=1}^{n}$.
    3. Compute multipliers and solve for the output values of each leaf. $\gamma$ is the output value of a leaf here: $$F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbf {1} _{R_{jm}}(x),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma ).$$
  3. Output $F_{M}(x)$.

Now the loss function that they input is log-loss: $\sum_i −(y_ilog(p_i)+(1−y_i)log(1−p_i))$ where $i$ is counter over all the samples in training set. The output of a tree leaf in the algorithm is assumed to be log-odds of samples belonging to the positive class.

In reality, how the algorithm is executed is - The $\gamma _{jm}$ is computed by expanding the Loss function in Taylor's series of degree 2 and then differentiating that w.r.t $\gamma$ .

I will write the formulation for the case when there is just one sample ($x1$) in a terminal node.

$$L(y_1,F_{m-1}(x1)+\gamma)=L(y_1,F_{m-1}(x1)) + \frac{\partial L(y_1,F_{m-1}(x1))}{\partial F_{m-1}}\gamma + \frac{1}{2}\frac{\partial^2 L(y_1,F_{m-1}(x1))}{\partial F_{m-1}^2}\gamma^2$$

Now, this formulation is differentiated w.r.t $\gamma$ and set equal to 0 to get the optimal value of gamma

$$\gamma = -\frac{\frac{\partial L(y_1,F_{m-1}(x1))}{\partial F_{m-1}}}{\frac{\partial^2 L(y_1,F_{m-1}(x1))}{\partial F_{m-1}^2}}$$

This computes to $$Residual_1/P_0(1-P_0) \tag 1$$ where $P_0$ is the most recent predicted probability.

The example below is taken from this youtube video - linked to exact time 16:43

For an example case where $F_0(X1) = 0.69$ (these are log-odds)

(converting the log odds to probability) $P_0 = 0.67; Residual_1 = y_1-P_0 = 0.33$

Now they compute $\gamma$ (the output value) for the leaf containing $x1$ by plugging these values into the equation (1) and get the $\gamma$ to be equal to $$\frac{0.33}{(0.67)(0.33)} = 1.49$$

Questions:

  1. How can we approximate the loss function $L$ for each gamma. We need to know radius of convergence around that $\gamma$ I think and then the $\gamma$ thus computed at the end should have lied in that radius of convergence?
  2. The log-loss (binary cross-entropy) is monotonic in predicted probability value. It increases with $P$ if $y = 1$ and decreases if $y = 0$ and thus it is also monotonic in log-odds since log-odds and probability are monotonic w.r.t each other.In that light why is $\gamma$ not equal to as-high-as-numerical-stability-permits when $y=1$ and $0$ when $y=0$?

Topic natural-gradient-boosting mathematics xgboost decision-trees machine-learning

Category Data Science


For Q1, check out these two stats.SE posts:

They bring up questions about the goodness of the second-order expansion, which is even more strict than asking about radius of convergence, and I think the discussion probably answers this reasonably well. I suspect that the common loss functions also have pretty nice Taylor expansion properties, but I haven't worked any of them out.

For Q2, the fact that the selected $\gamma^*$ is not infinite is because of the Taylor approximation. And in this context, that's actually a very useful thing to have, to prevent overfitting! Again have a look at the answer to the second question linked above, it has some additional information about other possible descent options.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.