Memory efficient encoding logic for group categories

Question

Memory efficient encoding logic for group categories

redguy

2020年2月15日 08:07

I have a huge dataset with categorical data. It is comprised of alerts having multiple properties. Each alert belongs to a group, and some even belong to multiple groups. It looks somewhat like this:

     GroupID           System        State       TimeStamp        etc...
0    [1, 2, 3, 4]         A           REC           ...
1    [1, 2, 3, 4]         A           SNT           ...
2    [2, 4]               B           REC 
3    [2, 4]               B           PND
4    [2, 4]               B           COM
5    [2, 4]               B           SNT
6    [2]                  C           RCV
7    [2]                  C           ACC
...

There are more than 100000 different group IDs in over 3 mil alerts.

Creating a column with a single Group ID value (not a list) means some alerts will appear more than once, which is not good given the already huge dataset.
Creating a separate column for each group (binary encoding) would expand my data too much horizontally.

What is a memory efficient way of encoding Groups?

Topic categorical-encoding data-science-model encoding efficiency machine-learning

Category Data Science

Carlos Mougan · Accepted Answer · 2020年2月15日 08:07

There is several techniques that could work for you:

Target Encoder: Works well when there is a high cardinality of a categorical feature.
Ordinal/Label Encoding: Tradition label encoding
Weight Of Evidence: tells the predictive power of an independent variable in relation to the dependent variable

Memory efficient encoding logic for group categories

About