Aggregating multiple encoded categorical values

Question

Aggregating multiple encoded categorical values

Vishwa Kalyanaraman

2022年5月20日 05:05

I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables.

I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much.

Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d

I have managed to encode each individual value in the feature but am unsure how to aggregate these values for each row.

How should these encoded values be combined?

Topic feature-engineering encoding categorical-data machine-learning

Category Data Science

ggordon · Accepted Answer · 2020年3月27日 09:52

If individual categories are important in your analysis, you could split the category column into multiple columns based on the amount of different category values then pivot your data set to have multiple row entries per category.

Visually

recordid | category | value
-------- | -------- | -----
   1     |   a      |  5
   2     |   a,b,c  |  10

would become

recordid | category |  value
-------- | -------- | ------
   1     |   a      |   5
   2     |   a      |   10
   2     |   b      |   10
   2     |   c      |   10

You could then make further aggregations or transformations as necessary.

Otherwise you could consider "a" and "abc" and "abcd" a new category, i.e. an item with categories "abc" is different from an item with just the category "a"

Aggregating multiple encoded categorical values

About