Why does my char level Keras tokenizer add spaces when converting sequences to texts?

Question

Why does my char level Keras tokenizer add spaces when converting sequences to texts?

Alexandre

2022年3月21日 05:05

I create a tokenizer with

import tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(split='', char_level=True, ...)
tokenizer.fit_to_texts(...)

But when I convert sequences of tokens to texts, the result contains a space after each character (except for the last one):

test_text = 'this is a test'
seq = tokenizer.texts_to_sequences([test_text])
r = tokenizer.sequences_to_texts(seq)[0]
assert(r == ''.join([ c+' ' for c in test_text ])[:-1])

Is there a way to avoid this added spaces? Am I missing some configuration parameter?

Topic tokenization python-3.x keras

Category Data Science

Hendrik · Accepted Answer · 2022年2月13日 16:28

This is a consequence of the (erroneous?) working of character level tokenizer in Keras.

A simple way to correct the output is to delete every second character in the output string:

seq_no_spaces = [text[::2] for text in seq]

Why does my char level Keras tokenizer add spaces when converting sequences to texts?

About