Aggregating multiple encoded categorical values

I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables.

I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much.

Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d

I have managed to encode each individual value in the feature but am unsure how to aggregate these values for each row.

How should these encoded values be combined?

Topic feature-engineering encoding categorical-data machine-learning

Category Data Science


If individual categories are important in your analysis, you could split the category column into multiple columns based on the amount of different category values then pivot your data set to have multiple row entries per category.

Visually

recordid | category | value
-------- | -------- | -----
   1     |   a      |  5
   2     |   a,b,c  |  10

would become

recordid | category |  value
-------- | -------- | ------
   1     |   a      |   5
   2     |   a      |   10
   2     |   b      |   10
   2     |   c      |   10

You could then make further aggregations or transformations as necessary.

Otherwise you could consider "a" and "abc" and "abcd" a new category, i.e. an item with categories "abc" is different from an item with just the category "a"

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.