BPE vs WordPiece Tokenization - when to use / which?

What's the general tradeoff between choosing BPE vs WordPiece Tokenization? When is one preferable to the other? Are there any differences in model performance between the two? I'm looking for a general overall answer, backed up with specific examples.

Topic transformer sentiment-analysis machine-translation nlp machine-learning

Category Data Science


Adding more info to noe's answer:

The difference between BPE and WordPiece lies in the way the symbol pairs are chosen for adding to the vocabulary. Instead of relying on the frequency of the pairs, WordPiece chooses the one which maximises the likelihood of the training data. This means that it trains a language model starting on the base vocabulary and picks the pair with the highest likelihood (pair = base vocab character + highest probability generated character). This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until the desired vocabulary is reached.


(This answer was originally a comment)

You can find the algorithmic difference here. In practical terms, their main difference is that BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning. The main performance difference usually comes not from the algorithm, but the specific implementation, e.g. sentencepiece offers a very fast C++ implementation of BPE. You can find fast Rust implementations of both in Hugginface's tokenizers.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.