Self Organising Map with variable length ordered sets of N-grams

Question

Self Organising Map with variable length ordered sets of N-grams

Cookies

2020年12月20日 22:15

I want to preface my question with the highlighted situation I have might not be applicable to kohonen self organising maps (SOM) due to a lack of understanding on my part so I do apologise if that is the case. If this the case I would greatly appreciate any suggestions on alternative methods to compare the similarities for my given input data.

I am trying to create a self organising map for the similarity comparison between the ngram ordered set of tokens that can be extracted from a given input sequence. For example, if the tokens are:

[A, B, C]

And my input is:

A, B, B, C, A

Then the resulting 1-gram ordered set would be:

A, B, C

I will try to highlight my question using a simple example followed by some more realistic cases I am dealing with.

If I have the collection of tokens:

[A, B, C, D, E, F]

The input data contains K variable length strings, made from these tokens.

A, B, A, B, C, D
A, A, A, B, B, E, F
A, B, C
E, F, E, E, F, D, C, A, A, A, A

I then generate the unique ordered sets for the tokens found in each of these sequences using the alphabetical order of tokens.

A, B, C, D
A, B, E, F
A, B, C
A, C, D, E, F

The above would be the 1-gram ordered set for the values. I however need to be able to represent the N-gram ordered sets.

E.g. for 2-grams the ordered set of tokens for each sequence would be:

AB, BA, BC, CD
AA, AB, BB, BE, EF
AB, BC
AA, CA, DC, EE, EF, FD, FE

And for 3-grams:

ABA, ABC, BAB, BCD
AAA, AAB, ABB, BBE, BEF
ABC
AAA, CAA, DCA, EEF, EFD, EFE, FDC, FEE

The 4-grams would only produce 3 ordered sets rather than 4 as the sequence A,B,C only has 3 tokens.

My initial thought was to have the of attributes be all the permutations of these n-gram token sequences.

For 1-grams this would:

A B C D E F

And for 2-grams this would be:

AA AB AC AD AE AF BA BB BC BD BE BF CA CB CC CD CE CF DA DB DC DD DE DF EA EB EC ED EE EF FA FB FC FD FE FF

Unfortunately, this very quickly gets out of hand, even when just going from 1-grams to 2-grams. With the number of attributes increasing by T^N... This seems completely impractical and illogical especially given the majority of the resulting matrix consist mainly of 0's where the attribute doesnt exist in a given sequence.

The problem is only made worse as the collection of tokens I have is closer to 150 rather than the 6 used in this example.

Each possible ordered set is also highly variable in length and will range from 1 to the number of permutations for the given token set.

The other issue is even if I was to optimise the number of attributes to only include the existing permutations in the sequence, there is no guarantee that testing data will conform to this set of attributes.

I am guessing that I will need to somehow convert these ordered sets to a more manageable set of attributes, but I am unsure of what would be a viable solution as each set of tokens seems to have an equal importance weighting.

I am also assuming that this be difficult to generalise and each N would have to have a different trained model? (One for 1-grams, another for 2-grams... up to the maximum N-gram chosen?)

Topic features ngrams feature-selection

Category Data Science

Self Organising Map with variable length ordered sets of N-grams

About