Handling hierarchical category independent variables

I have data with huge categorical attributes.

For example, main_column, sub_column1, sub_column2 are 3 hierarchical attributes. If if take dummy variable on these columns the column count is increased to 1000.

How to handle this kind of hierarchical attributes for a classification problem ?

Thanks !!

Topic dummy-variables hierarchical-data-format classification pandas

Category Data Science


I'd suggest the following:

  • 3 features, one for each level main_column, sub_column1, sub_column2
  • 2 additional features representing the hierarchical relation:
    • main_column/sub_column1
    • main_column/sub_column1/sub_column2

This way the training can select the most informative level of information between main_column, main_column/sub_column1, main_column/sub_column1/sub_column2. Depending on the data and algorithm used it might also make sense to discard rare cases for a subcategory and use a kind of "misc" category instead.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.