Dealing with observation with arbitrary number of categories with arbitary number of values

Question

Dealing with observation with arbitrary number of categories with arbitary number of values

King Powa

2022年5月26日 17:21

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In order to make myself extra-clear, I will state some property of the element $x_i$, along with an example.

For an element $x_i$:

It is possible that a particular feature category $f_{q_i}$ may appear more than once, but with a different value. So, for example, $x_i = \{..., f_{q_i} = 1, f_{q_i} = 4, ...\}$ (1, 4 are examples of values);
Both $f_{q_i}$ (the id used to describe the category) and its set of values $V_{q_i}$ are categorical variables.
An element $x_i$ may have an arbitrary number of feature categories describing it.
The number of unique pairs of $(f_{q_i}, v_{{q_i}_k})$ appearing in the dataset is around $903$.
The set of elements $X$ can appear in a set of observations $O = \{o_1, o_2, ..., o_m\}$, whose generic element $o_j$ can group multiple $x_i \in X$ as a sequence, and a final object $x_j$. The scope of the problem is to infer, for some given observations $O_{-j}$ the last element $x_j$, given the sequence.

How would you convert the categorical features in a meaningful way? I want to underline that my question refers only on how to convert these features within this particular setting, as the approaches I tried so far cannot solve the problem of an element sharing multiple arbitary categories of feature. The problem has been stated only because I wanted to make clear that target encoding is not a viable option, or is just possible in terms of how many times a particular pair $(f_{q_i}, v_{{q_i}_k})$ appears as belonging to an object in the set of final elements $X_J = \{x_{j_1}, x_{j_2}, ..., x_{j_m}\}$. This was, in fact, my initial approach. It is also possible that I have not fully understood target encoding, at this point.

Topic target-encoding categorical-encoding machine-learning

Category Data Science

Nikos M. · Accepted Answer · 2022年5月26日 17:21

One-hot encoding is an option, although in this case there will be more than one "hot" bits, but can still be used.

Many numerical encodings which allow combinations are possible. Restrictions are only regarding high dimensionality (which I think you cannot avoid).

Eg. a variation of numerical encoding is possible and if a certain element has more than one different values for same category then the sum of individual values can be used, provided it does not coincide with the numeric code of another value (this may or may not be desirable, eg the categories are correlated).

And so on..

Dealing with observation with arbitrary number of categories with arbitary number of values

About