create sequence of non dictionary words

Question

create sequence of non dictionary words

ubuntu_noob

2021年9月26日 09:02

I have a few word vectors-

recvfrom,sendto,epoll_pwait,recvfrom,sendto,epoll_pwait 

getuid,recvfrom,writev,getuid,epoll_pwait,getuid

Now i want to tokenized them and then make them into sequences to feed into the model-

For a standard word vector I would do something like this-

### Create sequence
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
data = pad_sequences(sequences, maxlen=50)

But in my data I have non dictionary words and also I have some repeating words. How do I convert this data into sequences?

Topic tokenization keras scikit-learn nlp

Category Data Science

pygo · Accepted Answer · 2020年8月2日 18:02

this can be done in multiple ways i am not one hundred percent understand the question because it is badly worded but there are 3 ways!

if you are doing this for a neural network you can use keras embedding layer and to create the sequence to feed to this embedding layer you can use one hot and padding from the preprocessing packages of course the sequence can be used regardless of if it is a nn the second thing you can do is use gensim to create word to vector the create a list of words or if you want the most control you can do something like

for i in input_texts:
    for char in i:
        if char not in input_characters:
            input_characters.add(char)

for i in target_texts:
    for char in i:
        if char not in target_characters:
            target_characters.add(char)
            ``` 
and then 

input_token_index = dict(
    [(char, i/num_encoder_tokens) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i/num_decoder_tokens) for i, char in enumerate(target_characters)])

in your scenario you can just iterate through the sentence with a for loop and then create a dictionary after which you can feed it through a one hot and padding preprocessing from Keras.preprocessing and finally create an embedding

create sequence of non dictionary words

About