Xgboost quantile regression via custom objective

I am new to GBM and xgboost, and am currently using xgboost_0.6-2 in R. The modeling runs well with the standard objective function objective = reg:linear and after reading this NIH paper I wanted to run a quantile regression using a custom objective function, but it iterates exactly 11 times and the metric does not change.

I just simply switched out the 'pred' statement following the GitHub xgboost demo, but am afraid it is more complicated than that and I cannot find any other examples on using the custom objective function. Do I need to take it a step further and take derivatives for the 'grad' and 'hess' part?

Or could it be a problem with xgboost (doubtful)?

qntregobj - function(preds, dtrain) {
  qr_alpha = .5
  labels - getinfo(dtrain, label)
  preds - ifelse( preds - labels = 0
                 , (1-qr_alpha)*abs(preds - labels)
                 , qr_alpha*abs(preds - labels)
                 )
  grad - preds - labels
  hess - preds * (1 - preds)
  return(list(grad = grad, hess = hess))
}

step1.param - list( objective = qntregobj
                   , booster = gbtree
                   , eval.metric = rmse
                   , 'nthread' = 16
                   )
set.seed(123)
step1.xgbTreeCV - xgb.cv(param = step1.param
              , data = xgb.train
              , nrounds  = nrounds
              , nfold = 10
              , scale_pos_weight = 1
              
              , stratified = T
              , watchlist = watchlist
              
              , verbose = F
              , early_stopping_rounds = 10
              , maximize = FALSE
              
              ## set default parameters here - baseline
              , max_depth = 6
              , min_child_weight = 1
              , gamma = 0
              , subsample = 1
              , colsample_bytree = 1
              , lambda = 1
              , alpha = 0
              , eta = 0.3
  )
  print(Sys.time() - start.time)

  step1.dat - step1.xgbTreeCV$evaluation_log
  step1.dat

Which produces:

iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std nround
 1:    1        122.6362     0.04268346       122.6354     0.3849658      1
 2:    2        122.6362     0.04268346       122.6354     0.3849658      2
 3:    3        122.6362     0.04268346       122.6354     0.3849658      3
 4:    4        122.6362     0.04268346       122.6354     0.3849658      4
 5:    5        122.6362     0.04268346       122.6354     0.3849658      5
 6:    6        122.6362     0.04268346       122.6354     0.3849658      6
 7:    7        122.6362     0.04268346       122.6354     0.3849658      7
 8:    8        122.6362     0.04268346       122.6354     0.3849658      8
 9:    9        122.6362     0.04268346       122.6354     0.3849658      9
10:   10        122.6362     0.04268346       122.6354     0.3849658     10
11:   11        122.6362     0.04268346       122.6354     0.3849658     11

Topic xgboost gbm gradient-descent predictive-modeling machine-learning

Category Data Science


I realize that this question is old, but it may still be of interest, as XGBoost still doesn't provide quantile regression out-of-the-box. You tried to solve this by using a user-defined loss function, which is the obvious approach here. To employ a user-defined loss function in XGBoost, you have to provide the first and second derivative (called grad and hess in your code, probably for gradient and Hessian). In this point, XGBoost differs from the implementations of gradient boosted trees that are discussed in the NIH paper you cited.

Unfortunately, the derivates in your code are not correct. The correct ones are as follows:

pred <- ifelse(preds-labels >= 0, 1-qr_alpha, qr_alpha)
hess <- 0

But even these are slightly wrong, because both derivates don't exist when preds=labels. Moreover, the fact that the second derivate is constant is also a problem. A constant second derivative doesn't contain any information that the XGBoost's optimization algorithm could use. Both problems can be solved, but that requires more than just a custom objective function. That is probably the reason quantile regression has never been implemented in XGBoost, although the corresponding feature request is already five years old at the time of writing this.

If you're looking for a modern implementation of quantile regression with gradient boosted trees, you might want to try LightGBM. It supports quantile regression out of the box. Their solution to the problems mentioned above is explained in more detail in this nice blog post.


Perhaps the blog below provides an answer to your question.

https://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/

Without go through code in much detail, probably, your problem can be described as followed (from the blog):

In the case that the quantile value q is relatively far apart from the observed values within the partition, then because of the Gradient and Hessian both being constant for large difference x_i-q, the score stays zero and no split occurs.

Then the following solution is suggested:

An interesting solution is to force a split by adding randomization to the Gradient. When the differences between the observations x_i and the old quantile estimates q within partition are large, this randomization will force a random split of this volume.


Yes,

grad <- preds - labels

is specific to the logistic loss. See this question for a derivation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.