Memory efficient encoding logic for group categories

I have a huge dataset with categorical data. It is comprised of alerts having multiple properties. Each alert belongs to a group, and some even belong to multiple groups. It looks somewhat like this:

     GroupID           System        State       TimeStamp        etc...
0    [1, 2, 3, 4]         A           REC           ...
1    [1, 2, 3, 4]         A           SNT           ...
2    [2, 4]               B           REC 
3    [2, 4]               B           PND
4    [2, 4]               B           COM
5    [2, 4]               B           SNT
6    [2]                  C           RCV
7    [2]                  C           ACC
...

There are more than 100000 different group IDs in over 3 mil alerts.

  1. Creating a column with a single Group ID value (not a list) means some alerts will appear more than once, which is not good given the already huge dataset.
  2. Creating a separate column for each group (binary encoding) would expand my data too much horizontally.

What is a memory efficient way of encoding Groups?

Topic categorical-encoding data-science-model encoding efficiency machine-learning

Category Data Science


There is several techniques that could work for you:

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.