Does sklearn.pipeline have a single mechanism for cross-validation regardless of model API?

Question

Does sklearn.pipeline have a single mechanism for cross-validation regardless of model API?

user5406764

2022年1月12日 16:39

With a single standard interface (sklearn.pipeline) on top of different regressors, how do I use cross-validation?

The example below uses two regressors with different internal cross-validation mechanisms, and I'm trying to figure out the correct way to do this without resorting to calling each differently.

import catboost as cb
import numpy as np
from scikeras.wrappers import KerasRegressor
import sklearn.pipeline
import sklearn.preprocessing

# *************************************
# First, the common code
# *************************************

# Build x and y
# y is a simple linear function (with noise added)
x = np.arange(100)
y = x * 10 + 4 + np.random.random(len(x)) * np.random.binomial(len(x), 0.5)

# We only have one feature so this is needed to make the regressors happy
x = x.reshape(-1, 1)

# Split into train/test in preparation for cross-validation
x_train = x[:80]
x_test = x[80:]
y_train = y[:80]
y_test = y[80:]

# *************************************
# Exhibit A
# Cross-validation using e.g., CatBoost
# (Yes I know gbm isn't a great choice
# for this dataset, just ignore that
# for now)
# *************************************

pipeline = sklearn.pipeline.Pipeline([
    ('scaler', sklearn.preprocessing.StandardScaler()),
    ('model', cb.CatBoostRegressor())
])

pipeline.fit(
  x_train, 
  y_train, 

  # Tell CatBoostRegressor to use cross-validation
  # Pipeline lets me funnel params to the 'model' component
  # which is in this case a CatBoostRegressor
  model__eval_set=(x_test, y_test)
)

# *************************************
# Exhibit B
# Cross-validation using e.g., Keras
# *************************************

# Helper function for pipeline to create the model
def create_model():
    model = Sequential()
    model.add(Dense(units = 1, input_dim=x_train.shape[1]))
    model.add(Dense(units = 4))
    model.add(Dense(units = 1))
    model.compile(optimizer = 'adam', loss = 'mean_squared_error')
    return model

pipeline2 = sklearn.pipeline.Pipeline([
    ('scaler', sklearn.preprocessing.StandardScaler()),
    ('model', KerasRegressor(model=create_model))
])

pipeline2.fit(
  x_train, 
  y_train, 

  # We have to use another funnel-though that is
  # completely different to turn on CV for Keras
  model__validation_data=(x_test, y_test)
)

So my question is:

Can I use Pipeline API to activate cross-validation across these two (or any other) model types using the same call to fit()?

Exhibit A = pipeline.fit(x_train, y_train, Pipeline way to activate cv)
Exhibit B = pipeline2.fit(x_train, y_train, Pipeline way to activate cv)

are identical?

Topic pipelines cross-validation scikit-learn

Category Data Science

Does sklearn.pipeline have a single mechanism for cross-validation regardless of model API?

About