Why does Light GBM model produce different results while testing?

Using the Light GBM regressor, I have trained my data and, using Grid Search, I got the best parameters, but while testing with the best parameters I am getting different results each time, which means the model produces different results for each test iteration. I ran the lightgbm twice with the same parameters, but got different results in validation. I found the only random seed parameter to be baggingSeed. After fixing baggingSeed, the problem also occurred. Should I fix any …
Category: Data Science

Does lightGBM handle multicollinearity?

I have a dataset after feature selection of around 6500 features and 10,000 data rows. I am using LightGBM model. I want to know if I should check the feature set for multicollinearity. If two or more features are correlated how does it affect the tree building and classification prediction How does LightGBM deal with multicollinearity? Does it have any adverse effects?
Category: Data Science

Optimizing MAE degrades MAE metrics

I have run a lighgbm regression model by optimizing on RMSE and measuring the performance on RMSE: model = LGBMRegressor(objective="regression", n_estimators=500, n_jobs=8) model.fit(X_train, y_train, eval_metric="rmse", eval_set=[(X_train, y_train), (X_test, y_test)], early_stopping_rounds=20) The model keeps improving during the 500 iterations. Here are the performances I obtain on MAE: MAE on train : 1.080571 MAE on test : 1.258383 But the metric I'm really interested in is MAE, so I decided to optimize it directly (and choose it as the evaluation metric): model …
Category: Data Science

splitting point in LightGBM?

I am not able to understand how the first root node is selected in LightGBM and how the splitting at nodes happens further. I read blogs and related documents and I understand that in this histogram-based splitting happens. But it is not clear after the bins are made what is the decision on the basis of which split happens. How is the best split decided? Please elaborate on this.
Category: Data Science

Correct theoretical regularized objective function for XGB/LGBM (regression task)

I am writing an academic paper on the application of Machine Learning methods to Time Series Forecasting and I am unsure about how to write down the theoretical part about the regularized objective function for XGBoosting. Below you can find the equation given by the developers of the XGB algorithm for the regularized objective function (equation 2). The paper is called "XGBoost: A Scalable Tree Boosting System" by Chen & Guestrin (2016). In the Python API from the xgb library …
Category: Data Science

Model Performance on external validation Set really low?

I am using the LGBM model for binary classification. My train and test accuracies are 87% & 82% respectively with cross-validation of 89%. ROC-AUC score of 81%. But when evaluating model performance on an external validation test that has not been seen before, the model gives a roc-auc of 41%. Can somebody suggest what should be done?
Category: Data Science

Is my model overfitting ? Training Acc :93 % test accuracy 82%

I am using LGBM model for binary classification. After hyper-parameter tuning I get Training accuracy 0.9340 Test accuracy 0.8213 can I say my model is overfitting? Or is it acceptable in the industry? Also to add to this when I increase the num_leaves for the same model,I am able to achieve: Train Accuracy : 0.8675 test accuracy : 0.8137 Which one of these results are acceptable and can be reported?
Category: Data Science

LGBM model predicting only single class on unseen data!

I have built a LightGBM based machine learning model on data of molecules of two classes. The distribution is as follows. Class 0 has 5933 data points and class 1 has 4696. The train test accuracy I get on this data is around 87% and 82% respectively. The roc_auc_score is around 81.5%. But when I try to evaluate model performance on an entirely new dataset which model has never seen before with class label 0 and 1 both having 94 …
Category: Data Science

Model Dump Parser (like XGBFI) for LightGBM and CatBoost

Currently my employer has multiple GLM in a live environment. I am interested in identifying new features and interactions to enhance the accuracy of these GLM; for now I am limited to the GLM structure so simply deploying a solution which automatically accounts for interactions is not possible. I have in the past used XGBoost to identify powerful feature interactions through the use of XGBFI / XGBFIR. I am now looking in to using LightGBM and CatBoost to do the …
Category: Data Science

LightGBM predict_proba in thousandths place

Can someone explain to me how my lightgbm classification model's predict_proba() is in thousandths place for the positive class: prob_test = model.predict_proba(X_test) print(prob_test[:,1]) array([0.00219813, 0.00170795, 0.00125507, ..., 0.00248431, 0.00150855, 0.00185903]) Is this common/how is this calculated? Should there be concern on performance testing(AUC)? FYI: data is highly imbalanced train = 0.0017 ratio
Category: Data Science

Negative R2_score Bad predictions for my Sales prediction problem using LightGBM

My project involves trying to predict the sales quantity for a specific item across a whole year. I've used the LightGBM package for making the predictions. The params I've set for it are as follows: params = { 'nthread': 10, 'max_depth': 5, #DONE 'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression_l1', 'metric': 'mape', # this is abs(a-e)/max(1,a) 'num_leaves': 2, #DONE 'learning_rate': 0.2180, #DONE 'feature_fraction': 0.9, #DONE 'bagging_fraction': 0.990, #DONE 'bagging_freq': 1, #DONE 'lambda_l1': 3.097758978478437, #DONE 'lambda_l2': 2.9482537987198496, #DONE 'verbose': 1, 'min_child_weight': 6.996211413900573, …
Category: Data Science

Example for Boosting

Can someone exactly tell me how does boosting as implemented by LightGBM or XGBoost work in real case scenerio. Like I know it splits tree leaf wise instead of level wise, which will contribute to global average not just the loss of branch which will help it learn lower error rate faster than level wise tree. But I cannot understand completely until I see some real example, I have tried to look at so many articles and videos but everywhere …
Category: Data Science

How to specify scale_pos_weight value at runtime in Hyperopt?

I want to use LighgbmClassifier for a binary Classification. for Hyper Parameter tuning I want to use Hyperopt. The Dataset is imbalanced. Using Sklearns class_weight.compute_class_weight as shown below clas_wts_arr = class_weight.compute_class_weight('balanced',np.unique(y_trn),y_trn) self.scale_pos_wt = clas_wts_arr[0] / clas_wts_arr[1] The following is the space parameter that I am passing to the objective function space = {'objective' : hp.choice('objective', objective_list), 'boosting' : hp.choice('boosting', boosting_list), 'metric' : hp.choice('metric', metric_list), "max_depth": hp.quniform("max_depth", 1, 15,2), 'min_data_in_leaf': hp.quniform('min_data_in_leaf', 1, 256, 1), 'num_leaves': hp.quniform('num_leaves', 7, 150, 1), 'feature_fraction' : …
Topic: lightgbm
Category: Data Science

Proof of GOSS algorithm in lightGBM paper

In the LightGBM paper the authors make use of a newly developed sampling method GOSS to reduce the number of data instances needed for finding the best split of a given feature in a tree-node. They give an error estimation for the error made by sampling instead of taking the entire data (Theorem 3.2 in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf) I am interested in the proof of this Theorem for which the paper refers to "supplementary materials" Where can I find those?
Category: Data Science

Sliding window approach using SVR & LightGBM

I'm working on a multivariate time series forecast using a couple of ML algorithms (Neural Networks, Support Vector Machines & Gradient boosting algorithms). I need to measure the performance of each model. I've implemented the 1st model using Tensorflow 2.0. Training & testing data was created using tf.Dataset API. The data format is (window_data, forecast), where window_data represents a set of 24 timesteps and forecast represents the next timestep. Now I need to train 2nd & 3rd model using SVR …
Category: Data Science

Incorporating data over time into lightgbm

So I'm in the situation where I know what it is I'm trying to find, but not the terminology for it and I think that's why a lot of my google searches are directing me in the wrong direction, so apologies if some of this explanation ends up redundant. Essentially, I want to be able to incorporate historical trends into the lightgbm model I've been using. Basically I have a bunch of categorical health data currently but by default, currently …
Category: Data Science

How to make LightGBM to suppress output?

I have tried for a while to figure out how to "shut up" LightGBM. Especially, I would like to suppress the output of LightGBM during training (i.e. feedback on the boosting steps). My model: params = { 'objective': 'regression', 'learning_rate' :0.9, 'max_depth' : 1, 'metric': 'mean_squared_error', 'seed': 7, 'boosting_type' : 'gbdt' } gbm = lgb.train(params, lgb_train, num_boost_round=100000, valid_sets=lgb_eval, early_stopping_rounds=100) I tried to add verbose=0 as suggested in the docs, but this does not work. https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst Does anyone know how to …
Category: Data Science

LightGBM eval_set - what to do when I fit the final model (there's no test data left)

I'm using LightGBM's eval_set feature when fitting my model. This enables early stopping on the number of estimators used. callbacks = [lgb.early_stopping(80, verbose=0), lgb.log_evaluation(period=0)] fit_params = {"callbacks":callbacks, "eval_metric" : "auc", "eval_set" : [(x_train,y_train), (x_test,y_test)], "eval_names" : ['train', 'valid']} lg = LGBMClassifier(n_estimators=5000, verbose=-1,objective="binary", **{"scale_pos_weight":train_weight, "metric":"auc"})#"binary_logloss"}) This works great when doing cross validation and early stopping is triggered. But when I have finally selected a model, and want to train it on the full data set. I have no test data left …
Topic: lightgbm
Category: Data Science

Understanding feature_parallel distributed learning algorithm in LightGBMClassifier

I want to understand feature_parallel algorithm in LightGBMClassifier. It describes how it is done traditionally and how LightGBM aims to improve it The two ways are as follows (verbatim from linked site): Traditional Feature_parallel: Feature parallel aims to parallelize the “Find Best Split” in the decision tree. The procedure of traditional feature parallel is: Partition data vertically (different machines have different feature set). Workers find local best split point {feature, threshold} on local feature set. Communicate local best splits with …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.