catboost

Catboost not working properly when I remove non important variables (source of randomness?)

Tom

2022年5月20日 17:52

I was wondering if anyone has encountered the same. The thing is, when I run a catboost boosting model, delete non important variables (feature importance by prediction importance = 0, in fact these variables arenot in the boosting trees), rerun the model again without the zero-importance variables and see that the results changes. Has anyone encountered the same issue with this or know why is this happening? How to fix this? This does not happens in lightgbm or xgboost. I …

Topic: gradient-boosting-decision-trees catboost python

Category: Data Science

Analysis of prediction shift problem in gradient boosting

Abhijeet Biswas

2022年5月17日 09:01

I was going through the Catboost paper section 4.1 where they talk about the 'Analysis of prediction shift' using an example consisting of 2 features which are bernoulli random variables. I am unable to wrap my head around the experimental setup. Since there are only 2 indicator features, so we can have only 4 data points, everything else will be duplication. They mention that for train data points the output of the first estimator of the boosting model is biased, …

Topic: gradient-boosting-decision-trees catboost gbm

Category: Data Science

Model Dump Parser (like XGBFI) for LightGBM and CatBoost

bradS

2022年4月20日 23:01

Currently my employer has multiple GLM in a live environment. I am interested in identifying new features and interactions to enhance the accuracy of these GLM; for now I am limited to the GLM structure so simply deploying a solution which automatically accounts for interactions is not possible. I have in the past used XGBoost to identify powerful feature interactions through the use of XGBFI / XGBFIR. I am now looking in to using LightGBM and CatBoost to do the …

Topic: catboost lightgbm xgboost python

Category: Data Science

How to tell CatBoost which feature is categorical?

Fred Chang

2022年3月15日 05:43

I am excited to learn that CatBoost can handle categorical features by itself. One of my features, Department ID, is categorical. However, it looks like numeric, since the values are like 1001, 1002, ..., 1218. Those numbers are just IDs of the departments. It by no means has numeric or ordinal meanings. How do I tell CatBoost to treat it as categorical (nominal), not numeric? Thanks.

Topic: catboost categorical-data

Category: Data Science

How to tune a Catboost Regressor

MLlearner15

2022年3月4日 11:36

I have been trying to study about hyperparameter tuning for CatBoost regressor for my regression problem. The only issue being I can't figure out what all parameters should I tune for my use case out of the sea of parameters available for CatBoost. I am unable to find any helpful sources that would guide me through the selection of parameters to be tuned. So I would be appreciative to hear how do people usually choose the parameters they want to …

Topic: catboost data-science-model machine-learning

Category: Data Science

Confused about CatRegressor feature importance vs SHAP

Henrik

2022年2月16日 08:38

I'm confused about result from a CatBoostRegressor-model. I follow this article: https://towardsdatascience.com/catboost-regression-in-6-minutes-3487f3e5b329 My confusion is about the difference in order of the variables in the figure "CatBoost Feature Importance" and the Shapley Additive exPlanations (SHAP) plot. When I lab with other datasets I get even bigger differences between these two plots. Why can they differ? And what does it say about feature importance when one variables scores high on one and low on another? My own result.

Topic: catboost python

Category: Data Science

Unable to tune hyperparameters for CatBoostRegressor

spectre

2022年1月29日 18:03

I am trying to fit a CatBoostRegressor to my model. When I perform K fold CV for the baseline model everything works fine. But when I use Optuna for hyperparameter tuning, it does something really weird. It runs the first trial and then throws the following error:- [I 2021-08-26 08:00:56,865] Trial 0 finished with value: 0.7219653113910736 and parameters: {'model__depth': 2, 'model__iterations': 1715, 'model__subsample': 0.5627211605250965, 'model__learning_rate': 0.15601805222619286}. Best is trial 0 with value: 0.7219653113910736. [W 2021-08-26 08:00:56,869] Trial 1 failed because …

Topic: catboost hyperparameter-tuning

Category: Data Science

Are linear models better when dealing with too many features? If so, why?

dsbr_

2022年1月17日 10:12

I had to build a classification model in order to predict which what would be the user rating by using his/her review. (I was dealing with this dataset: Trip Advisor Hotel Reviews) After some preprocessing, I compared the results of a Logistic Regression with a CatBoost Classifier, both of them with the defatult hyperparameters. The Logistic Regression gave me a better AUC and F1-score. I've heard some colleagues saying that this happneed because linear models are better when dealing with …

Topic: catboost feature-engineering linear-regression decision-trees logistic-regression

Category: Data Science

Why feature engineering and filling NaN's reduce score?

Dmitry Sokolov

2021年12月6日 07:13

I used CatBoost for InClass Kaggle competition. I have tried various strategies to filling NaN values. Convert float binary variables to categorical. Add new categorical features (from age, for example). I have tried generate new features from existents, also tried remove irrelevant features (by correlation, feature importance, SHAP). But it all only makes it worse! Why? The best score came out without any preprocessing only with found hyper-parameters via random_seach

Topic: catboost feature-engineering kaggle feature-selection data-cleaning

Category: Data Science

Catboost not able to handle a very simple dataset?

user5406764

2021年11月21日 22:33

This is a post from a newbie and so might be a really poor question based on lack of knowledge. Thank you kindly! I'm using Catboost, which seems excellent, to fit a trivial dataset. The results are terrible. If someone could point me in the right direction I'd sure appreciate it. Here is the code in its entirety: import catboost as cb import numpy as np import pandas as pd from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split from sklearn.pipeline …

Topic: gradient-boosting-decision-trees catboost decision-trees random-forest machine-learning

Category: Data Science

Difference between model score on test part and Kaggle public score

Dmitry Sokolov

2021年11月13日 10:29

I tested my CatBoostModel model on part of data and get 0.92 score, but Kaggle public score was 0.9. I found new hyperparameters via randomsearch, new model score was 0.925, but on Kaggle score fell to 0.88. What should I do to validate the model correctly?

Topic: catboost validation score kaggle cross-validation

Category: Data Science

Lightgbm vs xgboost vs catboost

David Masip

2021年10月28日 14:52

I've seen that in Kaggle competitions people are using lightgbms where they used to use xgboost. My question is: when would you rather use xgboost instead of lightgbm? What about catboost?

Topic: catboost lightgbm xgboost kaggle machine-learning

Category: Data Science

Why does Catboost outperform other boosting algorithms?

Aastha Jha

2021年10月7日 04:43

I have noticed while working with multiple datasets that catboost with its default parameters tends to outperform lightgbm or xgboost with its default parameters even on a tabular dataset with no categorical features. I am assuming this has something to do with the way catboost constructs the decision trees but I just wanted to confirm this theory. If anyone could elaborate on why it performs better on non categorical data then that would be great! Thanks in advance!

Topic: catboost lightgbm boosting decision-trees python

Category: Data Science

How to use the eval set in catboost appropriately?

user125720

2021年9月25日 05:10

Let's say you have a dataset, and you split it into 80% training and 20% testing. Naturally, you want to find the optimal hyperparameters for your model, so with the training set, you plan to do cross validation and search parameter space. CatBoost has something called the eval set which is used to help avoid overfitting, but I have a fundamental question on how to use it appropriately. Say you do CV10. So now we have 10 iterations where 90% …

Topic: catboost theory supervised-learning machine-learning

Category: Data Science

How to achieve SHAP values for a CatBoost model in R?

user100740

2021年4月12日 23:02

I'm asked to create a SHAP analysis in R but I cannot find it how to obtain it for a CatBoost model. I can get the SHAP values of an XGBoost model with shap_values <- shap.values(xgb_model = model, X_train = train_X) but not for CatBoost. Here is the reproducible code for my CatBoost model: library(data.table) library(catboost) train_example <- data.table(categorical_feature = c("a", "b", "a", "a", "b"), payment = c(244, 52352, 4235, 3422, 535), age = c(34, 27, 19, 40, 92), target …

Topic: shap catboost classification r machine-learning

Category: Data Science

Feature Selection before modeling with Boosting Trees

Mamoud

2021年3月26日 15:16

I have read in some papers that the subset of features chosen for a boosting tree algorithm will make a big difference on the performanceso I've been trying RFE, Boruta, Clustering variables, correlation, WOE & IV and Chi-square Let's say I have a classification problem with over 40 variables, best results after a long long time testing : all variables for Lightgbm (except of one variable with high linearity) I removed correlated variables for Xgboost (around 8 correlated ones) I …

Topic: catboost lightgbm xgboost feature-selection r

Category: Data Science

How do Classification Algorithms such as Catboost and Random Forest parse test data?

Nathan

2021年3月18日 14:23

I would like to know how classification works with the algorithms listed above. My specific question is this, say I have a high signal continuous feature which has a certain distribution and I train a model according to some training data and it finds the best split for that feature. When I use the model on test data, would it split according to a specific number or by distribution? i.e if the number '10' provides the best split for the …

Topic: catboost training random-forest machine-learning

Category: Data Science

How to understand the definition of Greedy Target-based Statistics in the CatBoost paper

tktktk0711

2021年2月5日 23:59

There is a method named Target statistics to deal with categorical features in the catboost paper. I still some confusion about the mathematical form. Could you some guys to expain how to compute it! $$ \hat{x}^i_k = \frac{\sum^{p-1}_{j=1}[x_{\sigma_{j},k}=x_{\sigma_p,k}]Y_{\sigma_j}+a\cdot P}{\sum^{p-1}_{j=1}[x_{\sigma_{j},k}=x_{\sigma_p,k}]+a}$$

Topic: catboost

Category: Data Science

Does Gradient Boosting perform n-ary splits where n > 2?

Chong Lip Phang

2020年12月18日 15:27

I wonder whether algorithms such as GBM, XGBoost, CatBoost, and LightGBM perform more than two splits at a node in the decision trees? Can a node be split into 3 or more branches instead of merely binary splits? Can more than one feature be used in deciding how to split a node? Can a feature be re-used in splitting a descendant node?

Topic: natural-gradient-boosting catboost lightgbm xgboost gbm

Category: Data Science

Catboost multiclassification evaluation metric: Kappa & WKappa

Musa

2020年11月7日 15:03

I am working on an unbalanced classification problem and i want to use Kappa as my evaluation metric. Considering the classifier accepts weights (which i have given it), should i still be using weighted kappa or just use the standard kappa? I am not entirely sure of the difference to be honest. model = CatBoostClassifier( iterations = 1000, learning_rate = 0.1, random_seed = 400, one_hot_max_size = 15, loss_function = 'MultiClass', eval_metric = 'WKappa', # weighted kappa #ignored_features = ignore_list, class_weights= …

Topic: catboost multiclass-classification python

Category: Data Science

About