Output value of a gradient boosting decision tree node that has just a single example in it
The general gradient boosting algorithm for tree-based classifiers is as follows:
Input: training set $\{(x_{i},y_{i})\}_{i=1}^{n}$, a differentiable loss function $L(y,F(x))$, and a number of iterations $M$.
Algorithm:
Initialize model with a constant value: $\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}\sum _{i=1}^{n}L(y_{i},\gamma )$.
For m = 1 to M:
- Compute so-called pseudo-residuals: $$r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.$$
- Fit a base learner (or weak learner, e.g. tree) $\displaystyle h_{m}(x)$ to pseudo-residuals, i.e. train it using the training set $\{(x_{i},r_{im})\}_{i=1}^{n}$.
- Compute multipliers and solve for the output values of each leaf. $\gamma$ is the output value of a leaf here: $$F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbf {1} _{R_{jm}}(x),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma ).$$
Output $F_{M}(x)$.
Now the loss function that they input is log-loss: $\sum_i −(y_ilog(p_i)+(1−y_i)log(1−p_i))$ where $i$ is counter over all the samples in training set. The output of a tree leaf in the algorithm is assumed to be log-odds of samples belonging to the positive class.
In reality, how the algorithm is executed is - The $\gamma _{jm}$ is computed by expanding the Loss function in Taylor's series of degree 2 and then differentiating that w.r.t $\gamma$ .
I will write the formulation for the case when there is just one sample ($x1$) in a terminal node.
$$L(y_1,F_{m-1}(x1)+\gamma)=L(y_1,F_{m-1}(x1)) + \frac{\partial L(y_1,F_{m-1}(x1))}{\partial F_{m-1}}\gamma + \frac{1}{2}\frac{\partial^2 L(y_1,F_{m-1}(x1))}{\partial F_{m-1}^2}\gamma^2$$
Now, this formulation is differentiated w.r.t $\gamma$ and set equal to 0 to get the optimal value of gamma
$$\gamma = -\frac{\frac{\partial L(y_1,F_{m-1}(x1))}{\partial F_{m-1}}}{\frac{\partial^2 L(y_1,F_{m-1}(x1))}{\partial F_{m-1}^2}}$$
This computes to $$Residual_1/P_0(1-P_0) \tag 1$$ where $P_0$ is the most recent predicted probability.
The example below is taken from this youtube video - linked to exact time 16:43
For an example case where $F_0(X1) = 0.69$ (these are log-odds)
(converting the log odds to probability) $P_0 = 0.67; Residual_1 = y_1-P_0 = 0.33$
Now they compute $\gamma$ (the output value) for the leaf containing $x1$ by plugging these values into the equation (1) and get the $\gamma$ to be equal to $$\frac{0.33}{(0.67)(0.33)} = 1.49$$
Questions:
- How can we approximate the loss function $L$ for each gamma. We need to know radius of convergence around that $\gamma$ I think and then the $\gamma$ thus computed at the end should have lied in that radius of convergence?
- The log-loss (binary cross-entropy) is monotonic in predicted probability value. It increases with $P$ if $y = 1$ and decreases if $y = 0$ and thus it is also monotonic in log-odds since log-odds and probability are monotonic w.r.t each other.In that light why is $\gamma$ not equal to as-high-as-numerical-stability-permits when $y=1$ and $0$ when $y=0$?