Custom preprocessing using piplines
I have searched a lot for this issue but unfortunately came up with nothing.
Usually in a ML model, during preprocessing, we use Pipelines and ColumnTransformer to group together preprocessing steps and the algorithm. Now the problem with Pipelines is that it performs the specified preprocessing for all the columns. For example if I specify:-
pipeline = Pipeline(steps = [('scale', StandardScaler()), ('encode', OneHotEncoder())])
The above pipeline will apply Standard scaler to all the columns of the dataset and the encoder will apply to all categorical column. I don't want that. I want different preprocessing techniques for different columns. For example out of 5 numerical columns, I want Standard scaling only for 3 columns and for the rest 2 I want Robust scaling. Similarly for out of 4 categorical columns, I want OneHotEncoder for 2 columns and LabelEncoder for 2 columns.
How can I implement this using Pipelines or make_pipelines? Or can it even be implemented?
Edit based on comment:
I tried the ColumnTransofrmer method but it gives me an error on fitting:
data = pd.read_csv('cars_sampled.csv')
data
data1 = data.copy(deep = True)
data1
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']
preproc = ColumnTransformer(transformers = [
('num1', StandardScaler(), num1),
('num2', RobustScaler(), num2),
('cat_impute1', SimpleImputer(strategy = 'most_frequent'), cat_impute1),
('cat_impute2', SimpleImputer(strategy = 'constant', fill_value = 'Missing'), cat_impute2),
('cat_encode1', OneHotEncoder(), cat_encode1)
])
pipeline = Pipeline(
steps=[
(preprocessor, preproc),
(model, LinearRegression())
]
)
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)
pipeline.fit(train_x, train_y)
The error is in the last line pipeline.fit(train_x, train_y) and it is as follows:
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Edit 2:
data = pd.read_csv('cars_sampled.csv')
data
data1 = data.copy(deep = True)
data1
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
num1 = ['powerPS']
num2 = ['kilometer', 'age']
cat_impute1 = ['fuelType', 'gearbox', 'model']
cat_impute2 = ['vehicleType', 'notRepairedDamage']
cat_encode1 = ['model', 'fuelType', 'gearbox', 'notRepairedDamage', 'vehicleType', 'brand']
cat_pipe = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
('encode', OneHotEncoder())])
preproc = ColumnTransformer(transformers = [
('num1', StandardScaler(), num1),
('num2', RobustScaler(), num2),
('cat', cat_pipe, cat_encode1)
])
pipeline = Pipeline(
steps=[
(preprocessor, preproc),
(model, LinearRegression())
]
)
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.25, random_state = 69)
pipeline.fit(train_x, train_y)
Topic pipelines preprocessing python machine-learning
Category Data Science