Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?

Question

Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?

user123635

2022年6月1日 09:05

I want to fine-tune BERT by training it on a domain dataset of my own. The domain is specific and includes many terms that probably weren't included in the original dataset BERT was trained on. I know I have to use BERT's tokenizer as the model was originally trained on its embeddings. To my understanding words unknown to the tokenizer will be masked with [UNKNOWN]. What if some of these words are common in my dataset? Does it make sense to add new IDs for them? is it possible without interferring with the network's parameters and the existing embeddings? If so, how is it done?

Topic bert finetuning word-embeddings nlp

Category Data Science

noe · Accepted Answer · 2021年8月30日 12:48

To my understanding words unknown to the tokenizer will be masked with [UNKNOWN].

Your understanding is not correct.

BERT's vocabulary is defined not at word level, but at subword level. This means that words may be represented as multiple subwords. The way subword vocabularies work mostly avoids having out-of-vocabulary words, because words can be divided up to the character level, and characters from the training data are assured to be present in the subword vocabulary. Therefore, as long as the alphabet used in your fine-tuning data is the same as the training data, there should be no out-of-vocabulary words: your out-of-domain terms will simply be divided into smaller subwords. So you should not need to include new entries for them in the embedding table.

There are some answers in this site that may give you more examples, like this and this.

Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?

About