Handling different length string features and prediction of these based on other features

Question

Handling different length string features and prediction of these based on other features

Obiii

2019年7月17日 15:28

I am currently working on a problem where the dataset contains 200+ features (Let's call them the code features, e.g no.of.loops, memoryInst, loadInst, etc and Flags that are used to compile code that has such characteristics/code features)

The flags are represented as strings:

This is just dummy data.

snippet                 FlagsUsed                   no.of.loops   loadInst      memoryInst

1  Mergesort      " -a -b -c -d=10 -e -f =19  -c "    1              0             10

2  Bubblesort     " -a -c -f=230 "                    2              5             3

3  MatrixMulti     " -f=20 -z -f12 -f2f "             0              10            4

I need some help with how these flags should be represented in the data, I have tried one-hot encoding and dummy_variable methods but these methods have some disadvantage:

1) One-hot encoding: using this method, the no. features will become huge, there are 150+ flags and I creating a one hot vector for each one of them would result in 22500 feature which is not feasible.

   flag1    flag2    flag3  flag4
1  -a       -b       -c      NA
2  -c       -a       -b      -z

I would need to create long vectors for each of the above feature flag1, flag2.. flag4

2) Dummy_variable method: There are 200+ different types/levels/factors of these flags and dummy_variable method create a feature for each level/factor which is not feasible.

Also, the flags can repeat in a single string (1st snippet, -c repeats).

I am thinking of some clever hashing that would maintain information regarding the sequence of the flags and there value (toggle flag = 0/1, threshold flag = {lower, upper} ). But the problem with hashing is, I have to, in future, predict these flags using other features (code features) and if I hash these flags somehow I won't be able to reverse hash them.

I am thinking of some fixed size vector representation which could be reversed so that I can tell flag using a numeric or hex number.

Can anyone please guide me or put me in the right direction. Would be thankful!

Topic feature-map data-science-model feature-engineering r machine-learning

Category Data Science

Handling different length string features and prediction of these based on other features

About