Why does my char level Keras tokenizer add spaces when converting sequences to texts?

I create a tokenizer with

import tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(split='', char_level=True, ...)
tokenizer.fit_to_texts(...)

But when I convert sequences of tokens to texts, the result contains a space after each character (except for the last one):

test_text = 'this is a test'
seq = tokenizer.texts_to_sequences([test_text])
r = tokenizer.sequences_to_texts(seq)[0]
assert(r == ''.join([ c+' ' for c in test_text ])[:-1])

Is there a way to avoid this added spaces? Am I missing some configuration parameter?

Topic tokenization python-3.x keras

Category Data Science


This is a consequence of the (erroneous?) working of character level tokenizer in Keras.

A simple way to correct the output is to delete every second character in the output string:

seq_no_spaces = [text[::2] for text in seq]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.