Using the Light GBM regressor, I have trained my data and, using Grid Search, I got the best parameters, but while testing with the best parameters I am getting different results each time, which means the model produces different results for each test iteration. I ran the lightgbm twice with the same parameters, but got different results in validation. I found the only random seed parameter to be baggingSeed. After fixing baggingSeed, the problem also occurred. Should I fix any …
Page 359 of Elements Of Statistical Learning 2nd edition says the below. Can someone explain the intuition & simplify it in layman terms? Questions What is the reason/intuition & math behind fitting each successive tree in GBM on the negative gradient of the loss function? Is it done to make GBM more generalization on unseen test dataset? If so how does fitting on negative gradient achieve this generalization on test data?
I was going through the Catboost paper section 4.1 where they talk about the 'Analysis of prediction shift' using an example consisting of 2 features which are bernoulli random variables. I am unable to wrap my head around the experimental setup. Since there are only 2 indicator features, so we can have only 4 data points, everything else will be duplication. They mention that for train data points the output of the first estimator of the boosting model is biased, …
I have build a model using transactions data trying to predict the value of future transactions. The main algorithm is Gradient Boosting Machine. The overall accuracy on the testset is fine and there is no sign of overfitting. However, a small change in the training set creates radical change in the model, and in the predictions. But even when the testset change a little the overall accuracy is stable. The time period is from 2005 to today and when a …
For professional reasons I want to learn and understand random forests. I feel unsafe if my understanding is the correct or if I am doing logical errors. I got a data set with 15 million entries and want to make a regression for a numerical target (time). The data structure is: I have 7 categorical variables, 1 date and 4 numerical features. After data preparation I split the data into training and test data set. Than I defined a gradient …
I have more of a conceptual question I was hoping to get some feedback on. I am trying to run a boosted regression ML model to identify a subset of important predictors for some clinical condition. The dataset includes over 100000 rows, and close to 1000 predictors. Now, the etiology of the disease we are trying to predict is largely unknown. Thus, we likely don’t have data on many important predictors for the condition. That is to say, as a …
I am looking for a machine learning textbook which gives a detailed derivation of gradient boosting with all mathematics behind it. I will be happy for recommendations.
I am new to GBM and xgboost, and am currently using xgboost_0.6-2 in R. The modeling runs well with the standard objective function "objective" = "reg:linear" and after reading this NIH paper I wanted to run a quantile regression using a custom objective function, but it iterates exactly 11 times and the metric does not change. I just simply switched out the 'pred' statement following the GitHub xgboost demo, but am afraid it is more complicated than that and I …
Scikit Learn GradientBoostingRegressor: I was looking at the scikit-Learn documentation for GradientBoostingRegressor. Here it says that we can use 'ls' as a loss function which is least squares regression. But I am confused since least squares regression is a method to minimize the SSE loss function. So shouldn't they mention SSE here?
In the tutorial boosting from existing prediction in lightGBM R, there is a init_score parameter in function setinfo. I am wondering what init_score means? In the help page, it says: init_score: initial score is the base prediction lightgbm will boost from Another thing is what does "boost" mean in lightGBM?
As far as I know, to train learning to rank models, you need to have three things in the dataset: label or relevance group or query id feature vector For example, the Microsoft Learning to Rank dataset uses this format (label, group id, and features). 1 qid:10 1:0.031310 2:0.666667 ... 0 qid:10 1:0.078682 2:0.166667 ... I am trying out XGBoost that utilizes GBMs to do pairwise ranking. They have an example for a ranking task that uses the C++ program …
I'm currently using XGBoost on a data-set with 21 features (selected from list of some 150 features), then one-hot coded them to obtain ~98 features. A few of these 98 features are somewhat redundant, for example: a variable (feature) $A$ also appears as $\frac{B}{A}$ and $\frac{C}{A}$. My questions are : How (If?) do Boosted Decision Trees handle multicollinearity? How would the existence of multicollinearity affect prediction if it is not handled? From what I understand, the model is learning more …
Is there an algorithm out there that creates a random forest but then prunes all the leaves that have an impurity measure above a certain threshold that I would determine? In other words, if I set min samples per leaf to be 500 and leaves have to have at least a 90% purity for example, the algorithm would only keep leaves that respect these parameters. My dataset is extremely noisy so most leaves have a gini impurity around 0.5 but …
I'm currently studying GBDT and started reading LightGBM's research paper. In section 4. they explain the Exclusive Feature Bundling algorithm, which aims at reducing the number of features by regrouping mutually exclusive features into bundles, treating them as a single feature. The researchers emphasize the fact that one must be able to retrieve the original values of the features from the bundle. Question: If we have a categorical feature that has been one-hot encoded, won't this algorithm simply reverse the …
background: in xgboost the $t$ iteration tries to fit a tree $f_t$ over all $n$ examples which minimizes the following objective: $$\sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ where $g_i, h_i$ are first order and second order derivatives over our previous best estimation $\hat{y}$ (from iteration $t-1$): $g_i=d_{\hat{y}}l(y_i, \hat{y}) $ $h_i=d^2_{\hat{y}}l(y_i, \hat{y}) $ and $l$ is our loss function. The question (finally): When building $f_t$ and considering a specific feature $k$ in a specific split, they use the following heuristic to assess only some …
I was wondering if there is literature on or someone could explain how to fit a decision tree to a gradient boosted trees classifier in order to derive more interpretable results. This is apparently the approach that Turi uses in their explain function which outputs something like this: Turi's explain function: from their page here. I know that for random forests you can average the contribution of each feature in every tree as seen in the TreeInterpreter python package, but …
the GBM implementation of the h2o package only allows the user to specify a loss function via the distribution argument, which defaults to multinomial for categorical response variables and gaussian for numerical response variables. According to the documentation, the loss functions are implied by the distributions. But I need to know which loss functions are used, and I can't find that anywhere in the documentation. I'm guessing it's the MSE for gaussian and cross-entropy for multinomial - does anybody here …
I would like to train my datasets in scikit-learn but export the final Gradient Boosting Regressor elsewhere so that I can make predictions directly on another platform. I am aware that we can obtain the individual decision trees used by the regressor by accessing regressor.estimators[].tree_. What I would like to know is how to fit these decision trees together to make the final regression predictor.
According to the documentation of Scikit-Learn Gradient Boosting Regressor: init: estimator or ‘zero’, default=None: An estimator object that is used to compute the initial predictions. init has to provide fit and predict. If ‘zero’, the initial raw predictions are set to zero. By default a DummyEstimator is used, predicting either the average target value (for loss=’ls’), or a quantile for the other losses. So what quantile is used for the DummyRegressor if the loss function is 'huber'? Is it the …
I wonder whether algorithms such as GBM, XGBoost, CatBoost, and LightGBM perform more than two splits at a node in the decision trees? Can a node be split into 3 or more branches instead of merely binary splits? Can more than one feature be used in deciding how to split a node? Can a feature be re-used in splitting a descendant node?