Handling different length string features and prediction of these based on other features
I am currently working on a problem where the dataset contains 200+ features (Let's call them the code features, e.g no.of.loops, memoryInst, loadInst, etc and Flags that are used to compile code that has such characteristics/code features)
The flags are represented as strings:
This is just dummy data.
snippet FlagsUsed no.of.loops loadInst memoryInst
1 Mergesort " -a -b -c -d=10 -e -f =19 -c " 1 0 10
2 Bubblesort " -a -c -f=230 " 2 5 3
3 MatrixMulti " -f=20 -z -f12 -f2f " 0 10 4
I need some help with how these flags should be represented in the data, I have tried one-hot encoding and dummy_variable methods but these methods have some disadvantage:
1) One-hot encoding: using this method, the no. features will become huge, there are 150+ flags and I creating a one hot vector for each one of them would result in 22500 feature which is not feasible.
flag1 flag2 flag3 flag4
1 -a -b -c NA
2 -c -a -b -z
I would need to create long vectors for each of the above feature flag1, flag2.. flag4
2) Dummy_variable method: There are 200+ different types/levels/factors of these flags and dummy_variable method create a feature for each level/factor which is not feasible.
Also, the flags can repeat in a single string (1st snippet, -c repeats).
I am thinking of some clever hashing that would maintain information regarding the sequence of the flags and there value (toggle flag = 0/1, threshold flag = {lower, upper} ). But the problem with hashing is, I have to, in future, predict these flags using other features (code features) and if I hash these flags somehow I won't be able to reverse hash them.
I am thinking of some fixed size vector representation which could be reversed so that I can tell flag using a numeric or hex number.
Can anyone please guide me or put me in the right direction. Would be thankful!
Topic feature-map data-science-model feature-engineering r machine-learning
Category Data Science