Is it okay to fine-tuning bert with large context for sequence classification?

I want to create sequence classification bert model. The input of model will be 2 sentence. But i want to fine tuning the model with large context data which consists of multiple sentences(which number of tokens could be exceed 512). Is it okay if the size of the training data and the size of the actual input data are different?

Thanks

Topic bert finetuning

Category Data Science


There is a limiting factor here, which is the positional embeddings.

In BERT, positional embeddings are trainable (not sinusoidal) and support a maximum of 512 positions. To exceed such a sequence length, you would need to extend the positional embedding table and have the extra entries be trained during the fine-tuning. This, however, would probably lead to performance degradation. So, technically possible but probably not Ok.

One option would be to keep only the first (or the last) 512 tokens of the sequences as input to BERT and see if the resulting performance is fine for your purposes.

As an alternative, you may use pre-trained long-context transformers like the LongFormer.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.