How to deal with a potencially multiple categorical variable

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem which I can not see the obvious best answer to.

For each individual we can have multiple instances of this categorical variable $X$. When such cases happen on numerical variables I usually take the max/mean/min depending on context. I of course, one can use said context to build something similar here. However I'm curious about a general approach.

Assuming that for each object (row in our input matrix) we can have multiple entries of an categorical variable. Furthermore, assume that said variable can have many different values, and that for the context it can be relevant the combinations per row.

What would be a general approach to this variable?

Topic dummy-variables feature-engineering aggregation categorical-data

Category Data Science


One option is one-hot encode the categorical features. Then increment the value of the feature to act as a counter for the number occurrences.

For example - if you are modeling a shopping cart, one feature would be "apples" and there would a count for the number of apples in the cart.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.