Catboost not working properly when I remove non important variables (source of randomness?)

Question

Catboost not working properly when I remove non important variables (source of randomness?)

Tom

2022年5月20日 17:52

I was wondering if anyone has encountered the same. The thing is, when I run a catboost boosting model, delete non important variables (feature importance by prediction importance = 0, in fact these variables arenot in the boosting trees), rerun the model again without the zero-importance variables and see that the results changes. Has anyone encountered the same issue with this or know why is this happening? How to fix this? This does not happens in lightgbm or xgboost. I know catboost have more sources of randomness that the traditional boosting models but would like to know where this is coming from and how to fix it.

Here is the reproductible example (IN PYTHON) : Lets copy and paste the following code for the model with zero feature importances.

import numpy as np
import catboost
from catboost import CatBoostClassifier, Pool
from catboost.datasets import titanic
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd

titanic_df = titanic()
X = titanic_df[0].drop('Survived',axis=1)
y = titanic_df[0].Survived

BIRTHDAY_SEED = 1995
X_train, X_val, y_train, y_val = train_test_split(X,y, random_state=BIRTHDAY_SEED)        

is_cat = (X_train.dtypes != float)
for feature, feat_is_cat in is_cat.to_dict().items():
    if feat_is_cat:
        X_train[feature].fillna(NAN, inplace=True)
        X_val[feature].fillna(NAN, inplace=True)

cat_features_index = np.where(is_cat)[0]
pool = Pool(X_train, y_train, cat_features=cat_features_index, feature_names=list(X.columns))

# define model parameters
best_model_params = {'learning_rate': 0.21000000000000002, 'iterations': 230, 'depth': 2, 'min_data_in_leaf': 150, 'l2_leaf_reg': 13.5, 'early_stopping_rounds': 20, 'max_leaves': 31}
fit_params = {eval_set : [(X_val,y_val)],}
model = CatBoostClassifier(eval_metric=AUC, **best_model_params,random_state  = BIRTHDAY_SEED)
model.fit(pool,**fit_params)


df_importances = pd.DataFrame({feature_name: model.feature_names_,importance:model.feature_importances_}).sort_values(importance)
# filter just important variables for the second model
df_importances_positive = df_importances[df_importances[importance]0]
important_variables = df_importances_positive[feature_name].tolist()

probs_val = model.predict_proba(X_val)[:,1]
model_auc = roc_auc_score(y_val, probs_val)


print(f[INFO] First model tree counts:{model.tree_count_})
print(f[INFO] First model feature importances: {df_importances.tail()})
print(f[INFO] First model roc_auc: {model_auc})

In the next step, I will eliminate the variables with zero feature importance and suddenly, the results changes. Please note that the following chunk is just a copy paste of the previous one but I remove the zero important variables with the following line (X = X[important_variables]). See that the results changes. In this reproductible example it is not that much but I have seen worse.

titanic_df = titanic()
X = titanic_df[0].drop('Survived',axis=1)

# THIS IS THE ONLY ONE THING THAT I AM CHANGING
X = X[important_variables]
y = titanic_df[0].Survived

BIRTHDAY_SEED = 1995
X_train, X_val, y_train, y_val = train_test_split(X,y, random_state=BIRTHDAY_SEED)        

is_cat = (X_train.dtypes != float)
for feature, feat_is_cat in is_cat.to_dict().items():
    if feat_is_cat:
        X_train[feature].fillna(NAN, inplace=True)
        X_val[feature].fillna(NAN, inplace=True)

cat_features_index = np.where(is_cat)[0]
pool = Pool(X_train, y_train, cat_features=cat_features_index, feature_names=list(X.columns))

# define model parameters
best_model_params = {'learning_rate': 0.21000000000000002, 'iterations': 230, 'depth': 2, 'min_data_in_leaf': 150, 'l2_leaf_reg': 13.5, 'early_stopping_rounds': 20, 'max_leaves': 31}
fit_params = {eval_set : [(X_val,y_val)],}
model = CatBoostClassifier(eval_metric=AUC, **best_model_params,random_state  = BIRTHDAY_SEED)
model.fit(pool,**fit_params)

df_importances = pd.DataFrame({feature_name: model.feature_names_,importance:model.feature_importances_}).sort_values(importance)
# filter just important variables for the second model
df_importances_positive = df_importances[df_importances[importance]0]
important_variables = df_importances_positive[feature_name].tolist()

probs_val = model.predict_proba(X_val)[:,1]
model_auc = roc_auc_score(y_val, probs_val)

print(f[INFO] Second model tree counts:{model.tree_count_})
print(f[INFO] Second model feature importances: {df_importances.tail()})
print(f[INFO] Second model roc_auc: {model_auc})
#

So my question is : If I am removing a variable that has zero importance, WHY the results of our model changes drastically, which source of randomness I am missing??

Thanks in advance.

Topic gradient-boosting-decision-trees catboost python

Category Data Science

Catboost not working properly when I remove non important variables (source of randomness?)

About