How do i generate text from ids in Torchtext's sentencepiece_numericalizer?

Question

How do i generate text from ids in Torchtext's sentencepiece_numericalizer?

Fhunmie

2022年4月29日 07:52

The torchtext sentencepiece_numericalizer() outputs a generator with indices SentencePiece model corresponding to token in the input sentence. From the generator, I can get the ids.

My question is how do I get the text back after training?

For example

 sp_id_generator = sentencepiece_numericalizer(sp_model)
 list_a = [sentencepiece encode as pieces, examples to   try!]
 list(sp_id_generator(list_a))
    [[9858, 9249, 1629, 1305, 1809, 53, 842],
     [2347, 13, 9, 150, 37]]

How do I convert list_a back t(i.e sentencepiece encode as pieces, examples to try!)?

Topic bert transformer pytorch nlp python

Category Data Science

Jindřich · Accepted Answer · 2022年4月29日 07:52

Torchtext does not implement this, but you can use directly the SentencePiece package. installable from PyPi.

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='test/test_model.model')
sp.decode([9858, 9249, 1629, 1305, 1809, 53, 842])

How do i generate text from ids in Torchtext's sentencepiece_numericalizer?

About