Adding a new token to a transformer model without breaking tokenization of subwords

Question

Adding a new token to a transformer model without breaking tokenization of subwords

Jigsaw

2022年2月13日 21:05

I'm running an experiment investigating the internal structure of large pre-trained models (BERT and RoBERTa, to be specific). Part of this experiment involves fine-tuning the models on a made-up new word in a specific sentential context and observing its predictions for that novel word in other contexts post-tuning. Because I am just trying to teach it a new word, we freeze the embeddings for the other words during fine-tuning so that only the weights for the new word are updated. This means that I would like for everything to be treated as if it were normal, except for adding the new word to the model's vocabulary.

I've added the new word to the model and tokenizer like in this MWE (in the case of BERT):

from transformers import BertTokenizer, BertForMaskedLM
new_words = ['myword1', 'myword2']
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
tokenizer.tokenize('myword1 myword2') 
# verify the words do not already exist in the vocabulary
# result: ['my', '##word', '##1', 'my', '##word', '##2']
    
tokenizer.add_tokens(new_words)
model.resize_token_embeddings(len(tokenizer))
    
tokenizer.tokenize('myword1 myword2')
# result: ['myword1', 'myword2']
   
new_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
    
new_tokenizer.tokenize('the period is a subword.')
# result: ['the', 'period', 'is', 'a', 'sub', '##word', '##.']
    
tokenizer.tokenize('but not when it follows myword1.')
# result: ['but', 'not', 'when', 'it', 'follows', 'myword1', '.']

How can I add a new token and have it behave correctly (i.e., by preserving the correct subword tokenization of adjacent strings)? Similar issues happen with RoBERTa, where the following word does not appear to be tokenized correctly (it is tokenized without the 'Ġ' that indicates a preceding space, which is present when the new word is replaced with an existing token). (This Ġ is also not present on the added token, but I assume that as long as the added token never occurs at the beginning of a string, it wouldn't matter.)

Edit: After poking around a bit more, I've found that this is related to my setting do_basic_tokenize=False. If it is not set to false, the results come out as expected. Nevertheless, I'd prefer to keep that set to false if there's a way to fix this.

Topic huggingface tokenization

Category Data Science

Jindřich · Accepted Answer · 2021年12月21日 13:00

1

Jindřich answered at 2021年12月21日 13:00

You can add the tokens as special tokens, similar to [SEP] or [CLS] using the add_special_tokens method. There will be separated during pre-tokenization and not passed further for tokenization.

Adding a new token to a transformer model without breaking tokenization of subwords

About