create sequence of non dictionary words
I have a few word vectors-
recvfrom,sendto,epoll_pwait,recvfrom,sendto,epoll_pwait
getuid,recvfrom,writev,getuid,epoll_pwait,getuid
Now i want to tokenized them and then make them into sequences to feed into the model-
For a standard word vector I would do something like this-
### Create sequence
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
data = pad_sequences(sequences, maxlen=50)
But in my data I have non dictionary words and also I have some repeating words. How do I convert this data into sequences?
Topic tokenization keras scikit-learn nlp
Category Data Science