BPE vs WordPiece Tokenization - when to use / which?

Question

BPE vs WordPiece Tokenization - when to use / which?

vgoklani

2021年11月30日 16:21

What's the general tradeoff between choosing BPE vs WordPiece Tokenization? When is one preferable to the other? Are there any differences in model performance between the two? I'm looking for a general overall answer, backed up with specific examples.

Topic transformer sentiment-analysis machine-translation nlp machine-learning

Category Data Science

Abhi25t · Accepted Answer · 2021年11月30日 16:21

Adding more info to noe's answer:

The difference between BPE and WordPiece lies in the way the symbol pairs are chosen for adding to the vocabulary. Instead of relying on the frequency of the pairs, WordPiece chooses the one which maximises the likelihood of the training data. This means that it trains a language model starting on the base vocabulary and picks the pair with the highest likelihood (pair = base vocab character + highest probability generated character). This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until the desired vocabulary is reached.

noe · Accepted Answer · 2021年3月25日 02:08

(This answer was originally a comment)

You can find the algorithmic difference here. In practical terms, their main difference is that BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning. The main performance difference usually comes not from the algorithm, but the specific implementation, e.g. sentencepiece offers a very fast C++ implementation of BPE. You can find fast Rust implementations of both in Hugginface's tokenizers.

BPE vs WordPiece Tokenization - when to use / which?

About