I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …
I am not able to understand how the first root node is selected in LightGBM and how the splitting at nodes happens further. I read blogs and related documents and I understand that in this histogram-based splitting happens. But it is not clear after the bins are made what is the decision on the basis of which split happens. How is the best split decided? Please elaborate on this.
I was wondering if anyone has encountered the same. The thing is, when I run a catboost boosting model, delete non important variables (feature importance by prediction importance = 0, in fact these variables arenot in the boosting trees), rerun the model again without the zero-importance variables and see that the results changes. Has anyone encountered the same issue with this or know why is this happening? How to fix this? This does not happens in lightgbm or xgboost. I …
I was going through the Catboost paper section 4.1 where they talk about the 'Analysis of prediction shift' using an example consisting of 2 features which are bernoulli random variables. I am unable to wrap my head around the experimental setup. Since there are only 2 indicator features, so we can have only 4 data points, everything else will be duplication. They mention that for train data points the output of the first estimator of the boosting model is biased, …
I have a regression problem where I need to predict three dependent variables ($y$) based on a set of independent variables ($x$): $$ (y_1,y_2,y_3) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n +u. $$ To solve this problem, I would prefer to use tree-based models (i.e. gradient boosting or random forest), since the independent variables ($x$) are correlated and the problem is non-linear with ex-ante unknown parameterization. I know that I could use sklearn's MultiOutputRegressor() …
I have a trained BDT and with sklearn predict_proba(X), I can get a probability between 0 and 1 for a predicted feature. I am now wondering, how this probability is calculated? Any ideas?
I have a dataset with 16 features and 32 class labels, which shows the following behavior: Neural network classification: high accuracy on train 100%, but low accuracy on the test set 3% (almost like random classification). If I make the network less flexible (reduce the number of neurons or hidden layers), then the train and test accuracy become about 10%. Gradient boosting tree classification: exactly same behavior. Flexible model results in 100% accuracy over train, but random accuracy on the …
If I set the num_parallel tree to 1 and max_iteration to 1 in boosted_tree_regressor of Google Big Query ML will it work as Decision tree regressor ? Also can such decision tree give negative predictions even if the training data is 0 or greater ?
I'm not completly sure about the bias/variance of boosted decision trees (LightGBM especially), thus I wonder if we generally would expect a performance boost by creating an ensemble of multiple LightGBM models, just like with Random Forest?
Can someone exactly tell me how does boosting as implemented by LightGBM or XGBoost work in real case scenerio. Like I know it splits tree leaf wise instead of level wise, which will contribute to global average not just the loss of branch which will help it learn lower error rate faster than level wise tree. But I cannot understand completely until I see some real example, I have tried to look at so many articles and videos but everywhere …
I thought the consensus was that XGBoost was largely scale-invariant and scaling of features isn't really necessary but something's going wrong and I don't understand what. I have a range of features with different scales and I'm trying to do a regression to target a variable that's in the range of 1E-7 or so (i.e. the target will be somewhere between 1E-7 and 9E-7). When I run XGBoost on this, I get warnings about "0 depth" trees and every prediction …
I am familiar with the shap python package and how to use it, I also have a pretty good idea about shap values in general, but it is still new to me. What I'm requesting are references (ideally python custom code in blog posts) to explain how to take an array of raw shap values (of shape num_features X num_samples) and get... feature importance interaction terms any other calculations the shap package does My motivation for this is that I …
Given new data, I trained the same model architecture and same hyperparameters (for example a random forest) as the current production model. How do I know that the new model that I trained is better than the current production model and make a decision to deploy the new model? Given that my problem scope is time series forecasting, the only way is to test on a timeline where both models were not trained on. For example, current production model was …
Thank you very much for your time. I would like to compare the contribution of predictor variables (X1) from different BRT (Boosted regression trees) models, which are from different spatial scales (Radius= 250m, 160m, 80m, 40m, 10m). But the contribution in BRT is a relative contribution. I want to know each variable's absolute contribution (absolute importance), then I can compare them. I used the formula "R-squared = 1 - (Residual sum of squares/Total sum of squares)" to calculate R-squared (total …
I want to understand feature_parallel algorithm in LightGBMClassifier. It describes how it is done traditionally and how LightGBM aims to improve it The two ways are as follows (verbatim from linked site): Traditional Feature_parallel: Feature parallel aims to parallelize the “Find Best Split” in the decision tree. The procedure of traditional feature parallel is: Partition data vertically (different machines have different feature set). Workers find local best split point {feature, threshold} on local feature set. Communicate local best splits with …
I trained multiple models for my problem and most ensemble algorithms resulted in lengthy fit and train time and huge model size on disk (approx 10GB for RandomForest) but when I tried HistGradientBoostingRegressor from sklearn the fit and training time is just around 10 sec and model size is also low (approx 1MB) with fairly accurate predictions. I was trying out GradientBoostRegressors when I came across this histogram based approach. It outperforms other algorithms in time and memory complexity. I …
All, This is a general question. I have a binary classification which predicts if someone is rich or not. I had a question from someone asking that if the probability someone is rich is 0.6 and another person is also given this probability are the reasons for WHY they are rich the same? I am using an xgboost and my instinct is to say no. e.g. if i were to profile each population > = 0.5, >= 0.6,... etc would …
I have a residuals plot: Definitions: let's call "blue_line" the line that would exist if I were to draw a straight line by fitting to the blue dots (predictions). My expectation is that if there were some features that contributed to y_real, blue_line would be parallel to dotted_red_line, but there might be any amount of noise centered around blue_line based on other features I'm not taking into account which contribute to y_real. What can I make about the model's ability …
This is a post from a newbie and so might be a really poor question based on lack of knowledge. Thank you kindly! I'm using Catboost, which seems excellent, to fit a trivial dataset. The results are terrible. If someone could point me in the right direction I'd sure appreciate it. Here is the code in its entirety: import catboost as cb import numpy as np import pandas as pd from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split from sklearn.pipeline …
I have asked this question here but seems no one is interested in it. Here is my understanding, pls correct me if there is any misunderstanding: Tree models is used to select the importance of features by Mean decrease impurity(let's ignore Permutation): https://blog.datadive.net/selecting-good-features-part-iii-random-forests/ but the tree has the weakness on heavy correlated features, where if one feature is split, the left one has no uncertainty to split, namely tree will only select one of two heavy correlated features (like LASSO). …