Custom preprocessing using piplines

Question

Custom preprocessing using piplines

spectre

2022年2月8日 05:03

I have searched a lot for this issue but unfortunately came up with nothing. Usually in a ML model, during preprocessing, we use Pipelines and ColumnTransformer to group together preprocessing steps and the algorithm. Now the problem with Pipelines is that it performs the specified preprocessing for all the columns. For example if I specify:-

pipeline = Pipeline(steps = [('scale', StandardScaler()), ('encode', OneHotEncoder())])

The above pipeline will apply Standard scaler to all the columns of the dataset and the encoder will apply to all categorical column. I don't want that. I want different preprocessing techniques for different columns. For example out of 5 numerical columns, I want Standard scaling only for 3 columns and for the rest 2 I want Robust scaling. Similarly for out of 4 categorical columns, I want OneHotEncoder for 2 columns and LabelEncoder for 2 columns.

How can I implement this using Pipelines or make_pipelines? Or can it even be implemented?

Edit based on comment:

I tried the ColumnTransofrmer method but it gives me an error on fitting:

data = pd.read_csv('cars_sampled.csv')
data

data1 = data.copy(deep = True)
data1

y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)

num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']


preproc = ColumnTransformer(transformers = [                                            
                                            ('num1', StandardScaler(), num1), 
                                            ('num2', RobustScaler(), num2),
                                            
                            ('cat_impute1', SimpleImputer(strategy = 'most_frequent'), cat_impute1), 
                            ('cat_impute2', SimpleImputer(strategy = 'constant', fill_value = 'Missing'), cat_impute2),

                                            
                                            ('cat_encode1', OneHotEncoder(), cat_encode1) 
                                            ])

pipeline = Pipeline(
    steps=[
        (preprocessor, preproc),
        (model, LinearRegression())
    ]
)

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)

pipeline.fit(train_x, train_y)

The error is in the last line pipeline.fit(train_x, train_y) and it is as follows:

ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.

Edit 2:

data = pd.read_csv('cars_sampled.csv')
data

data1 = data.copy(deep = True)
data1

y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)

num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']

cat_pipe = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
                ('encode', OneHotEncoder())])


preproc = ColumnTransformer(transformers = [                                            
                                            ('num1', StandardScaler(), num1), 
                                            ('num2', RobustScaler(), num2),
                                            ('cat', cat_pipe, cat_encode1) 
                                            ])

pipeline = Pipeline(
    steps=[
        (preprocessor, preproc),
        (model, LinearRegression())
    ]
)

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)

pipeline.fit(train_x, train_y)

Topic pipelines preprocessing python machine-learning

Category Data Science

Oxbowerce · Accepted Answer · 2021年11月30日 08:49

You mentioned the ColumnTransformer, which you should be able to use to achieve this (see also this page from the scikit-learn documentation:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression

preprocessor = ColumnTransformer(
    transformers=[
        ("num1", StandardScaler(), ["col1", "col2", "col3"]),
        ("num2", RobustScaler(), ["col4", "col5"]),
        ("cat1", OneHotEncoder(), ["col6", "col7"]),
        ("cat2", LabelEncoder(), ["col8", "col9"]),
    ]
)

pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression())
    ]
)

Custom preprocessing using piplines

About