Custom preprocessing using piplines
I have searched a lot for this issue but unfortunately came up with nothing.
Usually in a ML model, during preprocessing, we use Pipelines
and ColumnTransformer
to group together preprocessing steps and the algorithm. Now the problem with Pipelines
is that it performs the specified preprocessing for all the columns. For example if I specify:-
pipeline = Pipeline(steps = [('scale', StandardScaler()), ('encode', OneHotEncoder())])
The above pipeline will apply Standard scaler to all the columns of the dataset and the encoder will apply to all categorical column. I don't want that. I want different preprocessing techniques for different columns. For example out of 5 numerical columns, I want Standard scaling only for 3 columns and for the rest 2 I want Robust scaling. Similarly for out of 4 categorical columns, I want OneHotEncoder
for 2 columns and LabelEncoder
for 2 columns.
How can I implement this using Pipelines
or make_pipelines
? Or can it even be implemented?
Edit based on comment:
I tried the ColumnTransofrmer
method but it gives me an error on fitting:
data = pd.read_csv('cars_sampled.csv')
data
data1 = data.copy(deep = True)
data1
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']
preproc = ColumnTransformer(transformers = [
('num1', StandardScaler(), num1),
('num2', RobustScaler(), num2),
('cat_impute1', SimpleImputer(strategy = 'most_frequent'), cat_impute1),
('cat_impute2', SimpleImputer(strategy = 'constant', fill_value = 'Missing'), cat_impute2),
('cat_encode1', OneHotEncoder(), cat_encode1)
])
pipeline = Pipeline(
steps=[
(preprocessor, preproc),
(model, LinearRegression())
]
)
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)
pipeline.fit(train_x, train_y)
The error is in the last line pipeline.fit(train_x, train_y)
and it is as follows:
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Edit 2:
data = pd.read_csv('cars_sampled.csv')
data
data1 = data.copy(deep = True)
data1
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']
cat_pipe = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
('encode', OneHotEncoder())])
preproc = ColumnTransformer(transformers = [
('num1', StandardScaler(), num1),
('num2', RobustScaler(), num2),
('cat', cat_pipe, cat_encode1)
])
pipeline = Pipeline(
steps=[
(preprocessor, preproc),
(model, LinearRegression())
]
)
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)
pipeline.fit(train_x, train_y)
Topic pipelines preprocessing python machine-learning
Category Data Science