Catboost not able to handle a very simple dataset?

Question

Catboost not able to handle a very simple dataset?

user5406764

2021年11月21日 22:33

This is a post from a newbie and so might be a really poor question based on lack of knowledge. Thank you kindly!

I'm using Catboost, which seems excellent, to fit a trivial dataset. The results are terrible. If someone could point me in the right direction I'd sure appreciate it. Here is the code in its entirety:

import catboost as cb
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Some number of samples, not super important
samples = 26

# Our target is a simple linear progression (!)
yvals = range(samples)
y = pd.DataFrame({'y': yvals})

# Our feature is an exact COPY of the target (!)
X = pd.DataFrame.from_dict({
        'x0': np.array(yvals)
})

# I want to use shuffle = False for reasons beyond the scope of this question
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)

# Two stages to the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', cb.CatBoostRegressor(loss_function=RMSE, verbose=False))
])

# Here we go
pipe.fit(X_train, y_train)

# Print results
y_hat = pipe.predict(X_test)
r2 = r2_score(y_test, y_hat)
print('r2:', r2)

The output is:

r2: -4.256672011036048

I would have expected a perfect fit, or 1.0 for r2. Am I misusing catboost perhaps? Thanks again for any help!!!

Topic gradient-boosting-decision-trees catboost decision-trees random-forest machine-learning

Category Data Science

Ben Reiniger · Accepted Answer · 2021年11月21日 22:33

"Traditional" tree models cannot extrapolate well outside the training data's range, so "I want to use shuffle = False for reasons beyond the scope of this question" actually can't be ignored. If you expect testing/production data to have significantly different values, use a different kind of model.

There are tree models that support regressions in their leaves, sometimes called "model-based recursive partitioning", but that is not used as base learners for GBMs.

(GBMs like CatBoost can predict outside the range, but not well.)

Catboost not able to handle a very simple dataset?

About