How do you aggregate features of lists (pooling alternatives)?
Is it possible to reduce non-correlated multi-dimensional data over features to 1D data?
A working option is pooling (mean/min/max) over an embedding vector (n samples of embeddings of m dimensions). E.g. converts many embeddings (n × m) to a list of means (1 × m).
However, these all loose a lot of information (especially the relationships between features in single embeddings).
This doesn't have to be a reduction (i.e. the resulting 1D vector can be larger than m).
If it's unclear, here is an example:
Assume a text classification problem. We have multiple lists of lists of text. For each outer list, we have a label 1
or 0
(for contains-spam
and no-spam
).
e.g. Simplified version of our data:
- A:
[ABC, DEF, GH, SPAM, ACB]
;1
- B:
[HSU, OFM, FL]
;0
- C:
[JK, SPAM, SUPERSPAM, GA, IJK]
;1
- ...
The inner lists can be of different length. And we don't know where the spam is located in the list.
If we were using TFIDF, we could use concatenation with decent results. However, we want to use embeddings (or any other form of feature) which we can't concatenate.
Assuming we have 3-dimensional embeddings (E
) for our texts. E.g. ABC - (0.3; 0.5; 0.1)
. We can trivially get the mean of the texts in row A. I.e. data via mean of 3x5 becomes 1x5 (3D embeddings; 5 texts in row A)
. Now we can reduce all rows to the same vector length (1x5).
However, a lot of information gets lost. For example: the height of E[0]
when E[2]
is low (e.g. 0.1
in ABC
).
Do different alternatives to simple pooling exist?
I have read about using discrete cosine transforms for this. But this needs an ordering in the outer lists (e.g. spam at the front of the list, which we don't have).