Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?
I want to fine-tune BERT by training it on a domain dataset of my own. The domain is specific and includes many terms that probably weren't included in the original dataset BERT was trained on. I know I have to use BERT's tokenizer as the model was originally trained on its embeddings. To my understanding words unknown to the tokenizer will be masked with [UNKNOWN]. What if some of these words are common in my dataset? Does it make sense to add new IDs for them? is it possible without interferring with the network's parameters and the existing embeddings? If so, how is it done?
Topic bert finetuning word-embeddings nlp
Category Data Science