AutoML for categorical feature encoding

Question

AutoML for categorical feature encoding

The Great

2022年1月15日 17:34

I have an input dataset with more than 100 variables where around 80% of the variables are categorical in nature.

While some variables like gender, country etc can be one-hot encoded but I also have few variables which have an inherent order in their values such rating - Very good, good, bad etc.

Is there any auto-ML approach which we can use to do this encoding based on the variable type?

For ex: I would like to provide the below two lists as input to the auto-ml arguments.

one-hot-list = ['Gender', 'Country']  # one-hot encoding
ordinal_list = ['Feedback', 'Level_of_interest'] # ordinal encoding

Is there any auto-ML package that can do this for us?

Or is there any other efficient way to do this as I have 80 categorical columns

Topic automl h2o deep-learning neural-network machine-learning

Category Data Science

Anvar Kurmukov · Accepted Answer · 2022年1月15日 17:34

Sure, there are plenty of them, using scikit-learn it will looks as follow:

from  sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

ohe = OneHotEncoder()
ordine = OrdinalEncoder()

oh_col_names = [...]
ordin_col_names = [...]

encoded_oh = ohe.fit_transform(X[oh_col_names]) # supposing X is your pandas.DataFrame
encoded_ordin = ordin.fit_transform(X[ordin_col_names])

you could also use method get_feature_names_out (or get_feature_names in sklearn versions before 1.2) to get appropriate names to encoded features:

X.drop(oh_col_names, axis=1)
X[ohe.get_feature_names_out()] = encoded_oh

X.drop(ordin_col_names, axis=1)
X[ordin.get_feature_names_out()] = encoded_ordin
```

AutoML for categorical feature encoding

About