Self Organising Map with variable length ordered sets of N-grams
I want to preface my question with the highlighted situation I have might not be applicable to kohonen self organising maps (SOM) due to a lack of understanding on my part so I do apologise if that is the case. If this the case I would greatly appreciate any suggestions on alternative methods to compare the similarities for my given input data.
I am trying to create a self organising map for the similarity comparison between the ngram ordered set of tokens that can be extracted from a given input sequence. For example, if the tokens are:
[A, B, C]
And my input is:
A, B, B, C, A
Then the resulting 1-gram ordered set would be:
A, B, C
I will try to highlight my question using a simple example followed by some more realistic cases I am dealing with.
If I have the collection of tokens:
[A, B, C, D, E, F]
The input data contains K variable length strings, made from these tokens.
A, B, A, B, C, D
A, A, A, B, B, E, F
A, B, C
E, F, E, E, F, D, C, A, A, A, A
I then generate the unique ordered sets for the tokens found in each of these sequences using the alphabetical order of tokens.
A, B, C, D
A, B, E, F
A, B, C
A, C, D, E, F
The above would be the 1-gram ordered set for the values. I however need to be able to represent the N-gram ordered sets.
E.g. for 2-grams the ordered set of tokens for each sequence would be:
AB, BA, BC, CD
AA, AB, BB, BE, EF
AB, BC
AA, CA, DC, EE, EF, FD, FE
And for 3-grams:
ABA, ABC, BAB, BCD
AAA, AAB, ABB, BBE, BEF
ABC
AAA, CAA, DCA, EEF, EFD, EFE, FDC, FEE
The 4-grams would only produce 3 ordered sets rather than 4 as the sequence A,B,C only has 3 tokens.
My initial thought was to have the of attributes be all the permutations of these n-gram token sequences.
For 1-grams this would:
A B C D E F
And for 2-grams this would be:
AA AB AC AD AE AF BA BB BC BD BE BF CA CB CC CD CE CF DA DB DC DD DE DF EA EB EC ED EE EF FA FB FC FD FE FF
Unfortunately, this very quickly gets out of hand, even when just going from 1-grams to 2-grams. With the number of attributes increasing by T^N... This seems completely impractical and illogical especially given the majority of the resulting matrix consist mainly of 0's where the attribute doesnt exist in a given sequence.
The problem is only made worse as the collection of tokens I have is closer to 150 rather than the 6 used in this example.
Each possible ordered set is also highly variable in length and will range from 1 to the number of permutations for the given token set.
The other issue is even if I was to optimise the number of attributes to only include the existing permutations in the sequence, there is no guarantee that testing data will conform to this set of attributes.
I am guessing that I will need to somehow convert these ordered sets to a more manageable set of attributes, but I am unsure of what would be a viable solution as each set of tokens seems to have an equal importance weighting.
I am also assuming that this be difficult to generalise and each N would have to have a different trained model? (One for 1-grams, another for 2-grams... up to the maximum N-gram chosen?)
Topic features ngrams feature-selection
Category Data Science