Catboost not working properly when I remove non important variables (source of randomness?)

I was wondering if anyone has encountered the same. The thing is, when I run a catboost boosting model, delete non important variables (feature importance by prediction importance = 0, in fact these variables arenot in the boosting trees), rerun the model again without the zero-importance variables and see that the results changes. Has anyone encountered the same issue with this or know why is this happening? How to fix this? This does not happens in lightgbm or xgboost. I …
Category: Data Science

Analysis of prediction shift problem in gradient boosting

I was going through the Catboost paper section 4.1 where they talk about the 'Analysis of prediction shift' using an example consisting of 2 features which are bernoulli random variables. I am unable to wrap my head around the experimental setup. Since there are only 2 indicator features, so we can have only 4 data points, everything else will be duplication. They mention that for train data points the output of the first estimator of the boosting model is biased, …
Category: Data Science

Model Dump Parser (like XGBFI) for LightGBM and CatBoost

Currently my employer has multiple GLM in a live environment. I am interested in identifying new features and interactions to enhance the accuracy of these GLM; for now I am limited to the GLM structure so simply deploying a solution which automatically accounts for interactions is not possible. I have in the past used XGBoost to identify powerful feature interactions through the use of XGBFI / XGBFIR. I am now looking in to using LightGBM and CatBoost to do the …
Category: Data Science

How to tell CatBoost which feature is categorical?

I am excited to learn that CatBoost can handle categorical features by itself. One of my features, Department ID, is categorical. However, it looks like numeric, since the values are like 1001, 1002, ..., 1218. Those numbers are just IDs of the departments. It by no means has numeric or ordinal meanings. How do I tell CatBoost to treat it as categorical (nominal), not numeric? Thanks.
Category: Data Science

How to tune a Catboost Regressor

I have been trying to study about hyperparameter tuning for CatBoost regressor for my regression problem. The only issue being I can't figure out what all parameters should I tune for my use case out of the sea of parameters available for CatBoost. I am unable to find any helpful sources that would guide me through the selection of parameters to be tuned. So I would be appreciative to hear how do people usually choose the parameters they want to …
Category: Data Science

Confused about CatRegressor feature importance vs SHAP

I'm confused about result from a CatBoostRegressor-model. I follow this article: https://towardsdatascience.com/catboost-regression-in-6-minutes-3487f3e5b329 My confusion is about the difference in order of the variables in the figure "CatBoost Feature Importance" and the Shapley Additive exPlanations (SHAP) plot. When I lab with other datasets I get even bigger differences between these two plots. Why can they differ? And what does it say about feature importance when one variables scores high on one and low on another? My own result.
Category: Data Science

Unable to tune hyperparameters for CatBoostRegressor

I am trying to fit a CatBoostRegressor to my model. When I perform K fold CV for the baseline model everything works fine. But when I use Optuna for hyperparameter tuning, it does something really weird. It runs the first trial and then throws the following error:- [I 2021-08-26 08:00:56,865] Trial 0 finished with value: 0.7219653113910736 and parameters: {'model__depth': 2, 'model__iterations': 1715, 'model__subsample': 0.5627211605250965, 'model__learning_rate': 0.15601805222619286}. Best is trial 0 with value: 0.7219653113910736. [W 2021-08-26 08:00:56,869] Trial 1 failed because …
Category: Data Science

Are linear models better when dealing with too many features? If so, why?

I had to build a classification model in order to predict which what would be the user rating by using his/her review. (I was dealing with this dataset: Trip Advisor Hotel Reviews) After some preprocessing, I compared the results of a Logistic Regression with a CatBoost Classifier, both of them with the defatult hyperparameters. The Logistic Regression gave me a better AUC and F1-score. I've heard some colleagues saying that this happneed because linear models are better when dealing with …
Category: Data Science

Why feature engineering and filling NaN's reduce score?

I used CatBoost for InClass Kaggle competition. I have tried various strategies to filling NaN values. Convert float binary variables to categorical. Add new categorical features (from age, for example). I have tried generate new features from existents, also tried remove irrelevant features (by correlation, feature importance, SHAP). But it all only makes it worse! Why? The best score came out without any preprocessing only with found hyper-parameters via random_seach
Category: Data Science

Catboost not able to handle a very simple dataset?

This is a post from a newbie and so might be a really poor question based on lack of knowledge. Thank you kindly! I'm using Catboost, which seems excellent, to fit a trivial dataset. The results are terrible. If someone could point me in the right direction I'd sure appreciate it. Here is the code in its entirety: import catboost as cb import numpy as np import pandas as pd from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split from sklearn.pipeline …
Category: Data Science

Why does Catboost outperform other boosting algorithms?

I have noticed while working with multiple datasets that catboost with its default parameters tends to outperform lightgbm or xgboost with its default parameters even on a tabular dataset with no categorical features. I am assuming this has something to do with the way catboost constructs the decision trees but I just wanted to confirm this theory. If anyone could elaborate on why it performs better on non categorical data then that would be great! Thanks in advance!
Category: Data Science

How to use the eval set in catboost appropriately?

Let's say you have a dataset, and you split it into 80% training and 20% testing. Naturally, you want to find the optimal hyperparameters for your model, so with the training set, you plan to do cross validation and search parameter space. CatBoost has something called the eval set which is used to help avoid overfitting, but I have a fundamental question on how to use it appropriately. Say you do CV10. So now we have 10 iterations where 90% …
Category: Data Science

How to achieve SHAP values for a CatBoost model in R?

I'm asked to create a SHAP analysis in R but I cannot find it how to obtain it for a CatBoost model. I can get the SHAP values of an XGBoost model with shap_values <- shap.values(xgb_model = model, X_train = train_X) but not for CatBoost. Here is the reproducible code for my CatBoost model: library(data.table) library(catboost) train_example <- data.table(categorical_feature = c("a", "b", "a", "a", "b"), payment = c(244, 52352, 4235, 3422, 535), age = c(34, 27, 19, 40, 92), target …
Category: Data Science

Feature Selection before modeling with Boosting Trees

I have read in some papers that the subset of features chosen for a boosting tree algorithm will make a big difference on the performanceso I've been trying RFE, Boruta, Clustering variables, correlation, WOE & IV and Chi-square Let's say I have a classification problem with over 40 variables, best results after a long long time testing : all variables for Lightgbm (except of one variable with high linearity) I removed correlated variables for Xgboost (around 8 correlated ones) I …
Category: Data Science

How do Classification Algorithms such as Catboost and Random Forest parse test data?

I would like to know how classification works with the algorithms listed above. My specific question is this, say I have a high signal continuous feature which has a certain distribution and I train a model according to some training data and it finds the best split for that feature. When I use the model on test data, would it split according to a specific number or by distribution? i.e if the number '10' provides the best split for the …
Category: Data Science

How to understand the definition of Greedy Target-based Statistics in the CatBoost paper

There is a method named Target statistics to deal with categorical features in the catboost paper. I still some confusion about the mathematical form. Could you some guys to expain how to compute it! $$ \hat{x}^i_k = \frac{\sum^{p-1}_{j=1}[x_{\sigma_{j},k}=x_{\sigma_p,k}]Y_{\sigma_j}+a\cdot P}{\sum^{p-1}_{j=1}[x_{\sigma_{j},k}=x_{\sigma_p,k}]+a}$$
Topic: catboost
Category: Data Science

Does Gradient Boosting perform n-ary splits where n > 2?

I wonder whether algorithms such as GBM, XGBoost, CatBoost, and LightGBM perform more than two splits at a node in the decision trees? Can a node be split into 3 or more branches instead of merely binary splits? Can more than one feature be used in deciding how to split a node? Can a feature be re-used in splitting a descendant node?
Category: Data Science

Catboost multiclassification evaluation metric: Kappa & WKappa

I am working on an unbalanced classification problem and i want to use Kappa as my evaluation metric. Considering the classifier accepts weights (which i have given it), should i still be using weighted kappa or just use the standard kappa? I am not entirely sure of the difference to be honest. model = CatBoostClassifier( iterations = 1000, learning_rate = 0.1, random_seed = 400, one_hot_max_size = 15, loss_function = 'MultiClass', eval_metric = 'WKappa', # weighted kappa #ignored_features = ignore_list, class_weights= …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.