Got some troubles with using OneHotEncoder to multiple categories

Question

Got some troubles with using OneHotEncoder to multiple categories

83demon

2022年4月15日 01:00

I'm trying to get the final pipeline on the titanic dataset(Example was taken from the 'Hands-on ML' book).

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(list(df_num))),
    ('imputer',SimpleImputer(strategy='median', fill_value='num',missing_values=np.nan)),
    ('std_scaler',StandardScaler())
])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(list(df_cat))),
    ('imputer',SimpleImputer(strategy='most_frequent', fill_value='categorical',missing_values=np.nan)),
    ('cat_encoder', OneHotEncoder(sparse=False)),
])

from sklearn.pipeline import FeatureUnion


full_pipeline = FeatureUnion(transformer_list=[
        (num_pipeline, num_pipeline),
        (cat_pipeline, cat_pipeline),
    ])

df_prepared = full_pipeline.fit_transform(df)
df_prepared.shape
df_total = pd.DataFrame(df_prepared, columns=df.columns)
df_total

Where

df_num = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
df_cat = ['Sex', 'Embarked']

The problem is that I got the ValueError:

ValueError: Shape of passed values is (668, 10), indices imply (668, 7)

I tried dropping 'Embarked' from the df_cat and it created the DF somehow(but it was inaccurate). I assume it didn't work because of 'Embarked' had 3 categories(10-3 = 7 - that's the number of columns I wanted.)

If I drop 'Sex' category, I get ValueError: Shape of passed values is (668, 8), indices imply (668, 7) (10-2=8)

How to encode all two categories using Pipeline() and which wrong steps I've made?

Topic pipelines one-hot-encoding scikit-learn

Category Data Science

Fnguyen · Accepted Answer · 2020年7月14日 14:48

df.columns is incorrect because after One-hot-encoding you have more columns with different names. In your example instead of Sex you have two columns for the actual values and instead of Embarked you have three.

The whole point of the pipeline is to create those additional columns so why do you want to drop them? I do not know where you define df.columns but it seems to me the error is happening in this line:

df_total = pd.DataFrame(df_prepared, columns=df.columns)

Got some troubles with using OneHotEncoder to multiple categories

About