Why do BERT classification do worse with longer sequence length?

Question

Why do BERT classification do worse with longer sequence length?

Hooked

2022年3月11日 09:45

I've been experimenting using transformer networks like BERT for some simple classification tasks. My tasks are binary assignment, the datasets are relatively balanced, and the corpus are abstracts from PUBMED. The median number of tokens from pre-processing is about 350 but I'm finding a strange result as I vary the sequence length. While using too few tokens hampers BERT in a predictable way, BERT doesn't do better with more tokens. It looks like the optimal number of tokens is about 128 and consistently performs worse as I give it more of the abstract.

What could be causing this, and how can I investigate it further?

Topic bert transformer hyperparameter-tuning hyperparameter deep-learning

Category Data Science

Kasra Manshaei · Accepted Answer · 2022年3月11日 09:45

Some points to investigate

Same setting with same number of epochs may perform poorer on longer sequences. With longer sequences you are increasing the complexity of the data thus you may need to increase the complexity of model e.g. by letting model to train more or add more layers
If you are dealing with abstracts of papers, they are rich with keywords usually. You may have reached the capacity of your data by 128 words from an abstract which is already pretty covering the topic
In general make sure you are not applying a heavy pre-processing. In neural-based sequence models, models are able to cope with some pre-processing like stop-word removal and you better keep them in your sentence as this is how the pertained models like BERT were actually trained
In case you are destroying sentences i.e. you extract some textual features like keywords from the abstract and tokenise them they will (to some extent) work like "query". In this case there are studies that shows longer queries decrease BERT performance (See H4). More sequentially informative your input is, more you get use of sequence models. If sequence is destroyed then everything is basically reduced to a combination of word vectors thus nothing more than non-sequence methods (ranging from TF-IDF to Word Embeddings)

Why do BERT classification do worse with longer sequence length?

About