Retrieving the ordinal encoding of a variable after it's placed in a pipeline/columntransformer

Question

Retrieving the ordinal encoding of a variable after it's placed in a pipeline/columntransformer

lostwanderer

2022年1月14日 11:49

I am applying ordinal encoding to a dataset through a column transformer - how can I retrieve the ordinal encoding of a feature (e.g. Area)?

from sklearn.datasets import fetch_openml

df = fetch_openml(data_id=41214, as_frame=True).frame
df

df_train, df_test = train_test_split(df, test_size=0.33, random_state=0)

dt_preprocessor = ColumnTransformer(
    [
        (
            categorical,
            OrdinalEncoder(),
            [VehBrand, VehPower, VehGas, Area, Region],
        ),
        (numeric, passthrough, [VehAge, DrivAge, BonusMalus,Density]),
    ],
    remainder=drop,
)
f_names = [VehBrand, VehPower, VehGas, Area, Region, VehAge, DrivAge, BonusMalus, Density]

dt = Pipeline(
    [
        (preprocessor, dt_preprocessor),
        (
            regressor,
            DecisionTreeRegressor(criterion='squared_error', max_depth=3, ccp_alpha=1e-5, min_samples_leaf=2000),
        ),
    ]
)
dt.fit(
    df_train, df_train['ClaimFreq'], regressor__sample_weight=df_train[Exposure]
)

fig, ax = plt.subplots(figsize=(75, 50))
tree.plot_tree(dt['regressor'], feature_names=f_names, ax=ax, fontsize=30)
plt.show()
```

Topic pipelines encoding python

Category Data Science

Oxbowerce · Accepted Answer · 2022年1月14日 11:49

You can access the steps within a pipeline by their name using the named_steps attributes. After getting the preprocessing step you can then use the transformers_ attribute in combination with standard python indexing to get to the OrdinalEncoder. Using the categories_ attributes then gives you the attributes for the encoder and, since the index of each value is also the encoded value, also the corresponding value.

dt.named_steps["preprocessor"].transformers_[0][1].categories_

# [array(['B1', 'B10', 'B11', 'B12', 'B13', 'B14', 'B2', 'B3', 'B4', 'B5',
#         'B6'], dtype=object),
#  array([ 4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15.]),
#  array(['Diesel', 'Regular'], dtype=object),
#  array(['A', 'B', 'C', 'D', 'E', 'F'], dtype=object),
#  array(['R11', 'R21', 'R22', 'R23', 'R24', 'R25', 'R26', 'R31', 'R41',
#         'R42', 'R43', 'R52', 'R53', 'R54', 'R72', 'R73', 'R74', 'R82',
#         'R83', 'R91', 'R93', 'R94'], dtype=object)]

If you want the mapping in a dictionary format you can use a dictionary and list comprehension:

categories = dt.named_steps["preprocessor"].transformers_[0][1].categories_
[
    {
        value: encoding
        for value, encoding in zip(col_values, range(len(col_values)))
    }
    for col_values in categories
]

# [
#     {
#         "B1": 0,
#         "B10": 1,
#         "B11": 2,
#         "B12": 3,
#         "B13": 4,
#         "B14": 5,
#         "B2": 6,
#         "B3": 7,
#         "B4": 8,
#         "B5": 9,
#         "B6": 10
#     },
#     ...
# ]

Retrieving the ordinal encoding of a variable after it's placed in a pipeline/columntransformer

About