How do i generate text from ids in Torchtext's sentencepiece_numericalizer?

The torchtext sentencepiece_numericalizer() outputs a generator with indices SentencePiece model corresponding to token in the input sentence. From the generator, I can get the ids.

My question is how do I get the text back after training?

For example

 sp_id_generator = sentencepiece_numericalizer(sp_model)
 list_a = [sentencepiece encode as pieces, examples to   try!]
 list(sp_id_generator(list_a))
    [[9858, 9249, 1629, 1305, 1809, 53, 842],
     [2347, 13, 9, 150, 37]]

How do I convert list_a back t(i.e sentencepiece encode as pieces, examples to try!)?

Topic bert transformer pytorch nlp python

Category Data Science


Torchtext does not implement this, but you can use directly the SentencePiece package. installable from PyPi.

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='test/test_model.model')
sp.decode([9858, 9249, 1629, 1305, 1809, 53, 842])

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.