Tensorflow text tokenizer incorrect tokenization

Question

Tensorflow text tokenizer incorrect tokenization

data_person

2021年8月25日 02:31

I am trying to use TF Tokenizer for a NLP model

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split= )
sample_text = [This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ, 
               This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ]

tokenizer.fit_on_texts(sample_text)

print (tokenizer.texts_to_sequences([sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ]))

OP:

[[1, 7, 8, 9]]

Word_Index:

print(tokenizer.index_word[8])  === 'ab'
print(tokenizer.index_word[9])  === 'cdefghijklmnopqrstuvwxyz'

The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = in the Tokenizer so I expect the following op:

[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'

As in I want the tokenizer to create words based on space ( ) and not on any special characters

How do I make the tokenizer create tokens only on spaces?

Topic tokenization keras tensorflow deep-learning

Category Data Science

Tensorflow text tokenizer incorrect tokenization

About