How to get column names after One Hot Encoding when using Pipelines?
I am using Pipeline
and ColumnTransformer
to preprocess the data. Basically I am using them to impute null values, scale the numerical data and finally perform OneHotEncoding
. When I fit the ColumnTransformer
object to my train and test data the resulting output I get is an Array where the column names are 1, 2, 3, 4,5 and so on. Below is my code:-
cat_cols = [cname for cname in train_data1.columns if train_data1[cname].dtype == 'object']
num_cols = [cname for cname in train_data1.columns if train_data1[cname].dtype in ['int64',
'float64']]
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'mean')),
('scale', StandardScaler())])
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
('encode', OneHotEncoder(handle_unknown = 'ignore'))])
from sklearn.compose import ColumnTransformer
preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols), ('num', num_trans,
num_cols)])
X = preproc.fit_transform(train_data1)
X_final = preproc.transform(test_data1)
Here both X
and X_final
are Arrays where the column names are 1, 2, 3 and so on.
What I want is a DataFrame where the column names are present. I know I can convert Array to DataFrame using pd.DataFrame
but how do I the column names? I tried the following but it don't work:-
X_df = pd.DataFrame(X)
X_df.columns = preproc.get_feature_names()
Topic pipelines one-hot-encoding pandas
Category Data Science