Got some troubles with using OneHotEncoder to multiple categories
I'm trying to get the final pipeline on the titanic dataset(Example was taken from the 'Hands-on ML' book).
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
num_pipeline = Pipeline([
('selector', DataFrameSelector(list(df_num))),
('imputer',SimpleImputer(strategy='median', fill_value='num',missing_values=np.nan)),
('std_scaler',StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(list(df_cat))),
('imputer',SimpleImputer(strategy='most_frequent', fill_value='categorical',missing_values=np.nan)),
('cat_encoder', OneHotEncoder(sparse=False)),
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
(num_pipeline, num_pipeline),
(cat_pipeline, cat_pipeline),
])
df_prepared = full_pipeline.fit_transform(df)
df_prepared.shape
df_total = pd.DataFrame(df_prepared, columns=df.columns)
df_total
Where
df_num = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
df_cat = ['Sex', 'Embarked']
The problem is that I got the ValueError:
ValueError: Shape of passed values is (668, 10), indices imply (668, 7)
I tried dropping 'Embarked'
from the df_cat
and it created the DF somehow(but it was inaccurate). I assume it didn't work because of 'Embarked'
had 3 categories(10-3 = 7 - that's the number of columns I wanted.)
If I drop 'Sex'
category, I get ValueError: Shape of passed values is (668, 8), indices imply (668, 7)
(10-2=8)
How to encode all two categories using Pipeline()
and which wrong steps I've made?
Topic pipelines one-hot-encoding scikit-learn
Category Data Science