Correct theoretical regularized objective function for XGB/LGBM (regression task)
I am writing an academic paper on the application of Machine Learning methods to Time Series Forecasting and I am unsure about how to write down the theoretical part about the regularized objective function for XGBoosting. Below you can find the equation given by the developers of the XGB algorithm for the regularized objective function (equation 2). The paper is called XGBoost: A Scalable Tree Boosting System by Chen Guestrin (2016).
In the Python API from the xgb library there is a way to end up with a reg_lambda
parameter (L2 regularization parameter; Ridge regression equivalent) and a reg_alpha
parameter (L1 regularization parameter; Lasso regression equivalent). And I am a bit confused about the way the authors set up the regularized objective function. According to my understanding the ridge penalty is given by $\lambda \sum_{j=1}^p \beta_j^2$ and the lasso penalty is given by $\lambda \sum_{j=1}^p |\beta_j|$. However, the authors of the paper seem to use a combination of the two things that is not equivalent to either one since they apply the absolute value and square the beta term.
My questions are therefore the following ones:
(a) How does the regularized objective function look like if you want to allow for the inclusion of a l1 and a l2 parameter in the same model?
(b) How does the objective function of the LGBM(light GBM) algorithm differ from XGB or is it identical since the developers of the LGBM algorithm do not provide any theoretical details about the objective function or the iterations of boosting in their paper LightGBM: A Highly Efficient Gradient Boosting (Ke et al., 2017). According to my understanding, the only difference is the computation which is faster since the LGBM algorithm computes the information gains on the regression tree splits more efficiently and the features are bundled or am I misinterpreting/overlooking any aspects here?
(c) does it make sense to include both a l1 and a l2 regularization term in the same boosting model from a theoretical and practical perspective?
Topic lightgbm regularization xgboost
Category Data Science