How to get column names after One Hot Encoding when using Pipelines?

Question

How to get column names after One Hot Encoding when using Pipelines?

spectre

2021年12月6日 18:02

I am using Pipeline and ColumnTransformer to preprocess the data. Basically I am using them to impute null values, scale the numerical data and finally perform OneHotEncoding. When I fit the ColumnTransformer object to my train and test data the resulting output I get is an Array where the column names are 1, 2, 3, 4,5 and so on. Below is my code:-

cat_cols = [cname for cname in train_data1.columns if train_data1[cname].dtype == 'object']
num_cols = [cname for cname in train_data1.columns if train_data1[cname].dtype in ['int64', 
                                                                                   'float64']]

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'mean')), 
                          ('scale', StandardScaler())])
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')), 
                          ('encode', OneHotEncoder(handle_unknown = 'ignore'))])

from sklearn.compose import ColumnTransformer

preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols), ('num', num_trans, 
                                                                                  num_cols)])

X = preproc.fit_transform(train_data1)
X_final = preproc.transform(test_data1)

Here both X and X_final are Arrays where the column names are 1, 2, 3 and so on. What I want is a DataFrame where the column names are present. I know I can convert Array to DataFrame using pd.DataFrame but how do I the column names? I tried the following but it don't work:-

X_df = pd.DataFrame(X)
X_df.columns = preproc.get_feature_names()

Topic pipelines one-hot-encoding pandas

Category Data Science

Shrinidhi M · Accepted Answer · 2021年8月28日 17:13

Try this function to get the feature names.

def get_feature_names():  
        
        column_transformer = preproc      

        new_feature_names = []

        for i, transformer_item in enumerate(column_transformer.transformers_): 
            
            transformer_name, transformer, orig_feature_names = transformer_item
            orig_feature_names = list(orig_feature_names)
                
            if isinstance(transformer, Pipeline):
                # if pipeline, get the last transformer in the Pipeline
                transformer = transformer.steps[-1][1]

            if hasattr(transformer, 'get_feature_names'):

                if 'input_features' in transformer.get_feature_names.__code__.co_varnames:

                    names = list(transformer.get_feature_names(orig_feature_names))

                else:

                    names = list(transformer.get_feature_names())


            new_feature_names.extend(names)
        
        return new_feature_names

How to get column names after One Hot Encoding when using Pipelines?

About