Catboost not able to handle a very simple dataset?
This is a post from a newbie and so might be a really poor question based on lack of knowledge. Thank you kindly!
I'm using Catboost, which seems excellent, to fit a trivial dataset. The results are terrible. If someone could point me in the right direction I'd sure appreciate it. Here is the code in its entirety:
import catboost as cb
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Some number of samples, not super important
samples = 26
# Our target is a simple linear progression (!)
yvals = range(samples)
y = pd.DataFrame({'y': yvals})
# Our feature is an exact COPY of the target (!)
X = pd.DataFrame.from_dict({
'x0': np.array(yvals)
})
# I want to use shuffle = False for reasons beyond the scope of this question
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# Two stages to the pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('regressor', cb.CatBoostRegressor(loss_function=RMSE, verbose=False))
])
# Here we go
pipe.fit(X_train, y_train)
# Print results
y_hat = pipe.predict(X_test)
r2 = r2_score(y_test, y_hat)
print('r2:', r2)
The output is:
r2: -4.256672011036048
I would have expected a perfect fit, or 1.0 for r2. Am I misusing catboost perhaps? Thanks again for any help!!!