Pipelines with categorical and nan values
I am trying a Regression model on a dataset which has categorical and numerical variables along with nan values. I want to use Pipelines for imputation and encoding purposes. Now I have a few conditions which must be satisfied in building the model which are as follows:
1.) Use of Pipelines is a must for imputation and encoding (one hot encoding) purpose.
2.) Imputation should be done AFTER train test split.
3.) For feature selection (should be done AFTER train test split) use of mutual info regression and RFECV is a must.
This is what I tried so far:-
# X AND Y FEATURES
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
# CATEGORICAL VARIABLES AND NUMERICAL VARIABLES
num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]
cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']
# IMPUTATION/ENCODING TO BE DONE
num_trans = SimpleImputer(strategy = 'mean')
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
('onehotencode', OneHotEncoder(handle_unknown = 'ignore', sparse=False))])
# PREPROCESSING USING COLUMNS TRANSFORMER
preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols),
('num', num_trans, num_cols)])
# MODEL INSTANCE
lire_model = LinearRegression(n_jobs = -1)
#FINAL PIPELINE WHICH IMPUTES, ENCODES AND THEN FITS MODEL WHEN CALLED
lire_pipe = Pipeline(steps = [('preproc', preproc), ('model', lire_model)])
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2,
random_state=69)
# FEATURE SELECTION SECTION
# MUTUAL INFO FOR ALL VARIABLES
mi = mutual_info_regression(train_x, train_y)
mi
mi = pd.Series(mi)
mi.index = train_x.columns
mi.sort_values(ascending = False)
# RFE USING CV
rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,
cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)
rfecv.fit(train_x, train_y)
print('optimal no of features are:- ', rfecv.n_features_)
train_x.columns[rfecv.get_support()]
# BASELINE MODEL
cross_lire_score = -1 * cross_val_score(lire_pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'neg_mean_absolute_error')
base_lire_score = cross_lire_score.mean()
Now the problem I am facing is up until the train_test_split part, the pipeline feature has not been called and hence none of the nan values are imputed and neither encoding has been done. So running anything after train_test_split (i.e. Feature Selection part) will give me an error as there are nan values and categorical variables on which we are performing feature selection. The pipeline is not being called up until the baseline model cv. Only at that point will the imputation and encoding will happen. Not before that!
I tried some thing like below as a workaround (everything is same up until train_test_split):-
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state
=69)
# MANUALLY IMPUTING NAN VALUES
train_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)
train_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)
train_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)
train_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)
train_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],
inplace = True)
test_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)
test_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)
test_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)
test_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)
test_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],
inplace = True)
# MANUAL ENCODING
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['vehicleType', 'carname', '
fuelType']]))
train_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])
train_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)
train_x = train_x.reset_index(drop = True)
train_x_encoded = train_x_encoded.reset_index(drop = True)
train_x1 = pd.concat([train_x, train_x_encoded], axis = 1)
test_x_encoded = pd.DataFrame(ohe.transform(test_x[['vehicleType', 'carname', 'fuelType']]))
test_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])
test_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)
test_x = test_x.reset_index(drop = True)
test_x_encoded = test_x_encoded.reset_index(drop = True)
test_x1 = pd.concat([test_x, test_x_encoded], axis = 1)
# FEATURE SELECTION SECTION
mi = mutual_info_regression(train_x1, train_y)
mi
mi = pd.Series(mi)
mi.index = train_x1.columns
mi.sort_values(ascending = False)
# RFE USING CV
rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,
cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)
rfecv.fit(train_x1, train_y)
print('optimal no of features are:- ', rfecv.n_features_)
train_x1.columns[rfecv.get_support()]
# BASELINE MODEL
cross_lire_score = -1 * cross_val_score(lire_pipe, train_x1, train_y, cv = 5,
n_jobs = -1, scoring = 'neg_mean_absolute_error')
base_lire_score = cross_lire_score.mean()
But now there's no point declaring a pipline as I am manually doing the work , which completely defeats the purpose of a Pipeline!! It is mandatory that I use Pipeline along with all the conditions defined above.
Any help would be appreciated as I have spent the last 3 weeks, 4 days and a good part of my non existent social life trying to find a solution!
EDIT: I have uploaded the dataset along with the code into colab. Link is provided in the comments.
Topic rfe pipelines one-hot-encoding mutual-information preprocessing
Category Data Science