How does Keras Tokenizer choose tokens given a sentence?

I tried to find the answer to this question but I can't find anything, so I ask here: How does Keras Tokenizer choose tokens given a sentence of words ?

To be more precise with what I want to know, given this simple example:

#Import module
from keras.preprocessing.text import Tokenizer
# define a document
doc = ['The cat sat on the mat']
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the document

print('word_index : ',tokenizer.word_index)

This method creates the vocabulary index based on word frequency and then it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.

Therefore, this means that in the step in which tokenizer is fit on the document (I think in this step), it decides that the tokens are the words of the sentence. Why ? Is it possible to change this choice and choose as tokens the letters of the sentence ?

Thank you in advance.

Topic tokenization keras preprocessing deep-learning neural-network

Category Data Science

This is simply how the tokenizer works given the defaults that are defined, see also the documentation. By default the value for the split argument is ' ', meaning that it splits the sentences on every space character to get the tokens for that sentence. You can change this to get other multi-character tokens from a sentence. In addition, there is the char_level keyword which would create tokens use each character instead of multiple characters.

Often words are used as tokens as they carrie a meaning. This meaning is translated into "machine readable" format, which happens to be a number. So one distinct word will be one distinct token (or variable if you want to say so).

Per docs you can change the TF/Keras default behaviour of "choosing words" by adding the option char_level=True. So in your case:

tokenizer = Tokenizer(char_level=True)

Character level tokens are sometimes used in "sequence-to-sequence" models in which the (sequential) prediction essentially happens at the character level.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.