Would there be any reason to pretrain BERT on specific texts?

Question

Would there be any reason to pretrain BERT on specific texts?

Moradnejad

2021年2月10日 13:40

So the official BERT English model is trained on Wikipedia and BookCurpos (source).

Now, for example, let's say I want to use BERT for Movies tag recommendation. Is there any reason for me to pretrain a new BERT model from scratch on movie-related dataset?

Can my model become more accurate since I trained it on movie-related texts rather than general texts? Is there an example of such usage?

To be clear, the question is on the importance of context (not size) of the dataset.

Topic pretraining bert transfer-learning language-model

Category Data Science

noe · Accepted Answer · 2021年2月8日 10:17

Sure, if you have a large and good quality in-domain dataset, the results may certainly be better than with the generic pretrained BERT.

This has already been done before: BioBERT is a BERT model pretrained on biomedical texts:

[...] a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement).

Of course, other factors may be taken into account in the decision to pretrain such a model, e.g. computational budget.

Robert Link · Accepted Answer · 2021年2月8日 02:14

BERT is a fairly large model that requires many data and lots of training time to achieve its state-of-the-art performance. More often than not, there isn't enough data nor resources to completely train BERT from scratch. That's where these pretrained models are useful. The weights learned from prior training serve as a useful starting point for training your dataset -- a concept refereed to as transfer learning.

In a silly example, to properly generate movie tag recommendations, it first needs to learn how to "read" the tags. Or with image classification, it first needs to "see" the image. Training these models from scratch forces them to "learn" how to read or see before learning how to classify. With pretraining, the model already knows how to see/read and can better utilize training time/resources to optimize performance.

Many people freeze most layers during transfer learning and focus on training the tail-end of the model as a way to reduce the training time needed. How many layers you freeze -- if you freeze any at all -- depends on how much time you're willing to put into training the model. Play around, and see what happens with BERT. Good luck!

Would there be any reason to pretrain BERT on specific texts?

About