Tensorflow text tokenizer incorrect tokenization
I am trying to use TF Tokenizer
for a NLP model
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split= )
sample_text = [This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ,
This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences([sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) === 'ab'
print(tokenizer.index_word[9]) === 'cdefghijklmnopqrstuvwxyz'
The problem is that the tokenizer
creates tokens based on .
in this case. I am giving the split =
in the Tokenizer
so I expect the following op:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
As in I want the tokenizer to create words
based on space ( )
and not on any special characters
How do I make the tokenizer
create tokens only on spaces
?
Topic tokenization keras tensorflow deep-learning
Category Data Science