How to obtain original feature names after using one-hot encoding

This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier().

How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()?

The problem is that the input features to the classifier are not the original ones - they are numerically encoded ones from pandas DataFrame get_dummies.

For example, I take the mushroom dataset from the UCI repository. Features in the dataset include - cap_shape, cap_surface, cap_color, odor, etc.

pandas dataframe getdummies encodes these into multiple features based on values of the original features. Say cap_shape has values b,c,f,k...after encoding new columns are cap_shape_b, cap_shape_c, cap_shape_f. Similar transformations happen for other features.

After training, the classifier tells me that the top two features are: cap_shape_b, cap_shape_c, cap_shape_f, odor_a,odor_c, odor_f,odor_l. From this result thrown by the classifier, I would like my function to return the original features, that is, cap_shape and odor.

Topic dummy-variables one-hot-encoding decision-trees feature-selection

Category Data Science


As shown in these docs at the section "Classification", you can export your tree using graphviz (it states that you have to install the graphviz package, too). And this way you're able to visualize the tree built by the algorithm. About the problem of the input features being transformed from the original ones, it's a problem the algorithm can't help you with but you should be able to manage that by yourself if you've made the transformations yourself.

Any further doubt, comment.


If you just need names of the original features you can use a regex to parse them out. You can easily decide a naming convention for transformed features (using the prefix parameter in get_dummies). After getting the scores, you can traverse the list of features in ascending/descending order and parse the column names using regex, use an ordered dict to store the results.

If you need the whole dataset transformed back, then go with the inverse_transform method mentioned in other answers.


Consider using the one-hot encoder in category_encoders module for your encoding. It has an inverse_transform method which I believe will transform your one-hot encoded data back to its original form.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.