Why do BERT classification do worse with longer sequence length?

I've been experimenting using transformer networks like BERT for some simple classification tasks. My tasks are binary assignment, the datasets are relatively balanced, and the corpus are abstracts from PUBMED. The median number of tokens from pre-processing is about 350 but I'm finding a strange result as I vary the sequence length. While using too few tokens hampers BERT in a predictable way, BERT doesn't do better with more tokens. It looks like the optimal number of tokens is about 128 and consistently performs worse as I give it more of the abstract.

What could be causing this, and how can I investigate it further?

Topic bert transformer hyperparameter-tuning hyperparameter deep-learning

Category Data Science


Some points to investigate

  • Same setting with same number of epochs may perform poorer on longer sequences. With longer sequences you are increasing the complexity of the data thus you may need to increase the complexity of model e.g. by letting model to train more or add more layers
  • If you are dealing with abstracts of papers, they are rich with keywords usually. You may have reached the capacity of your data by 128 words from an abstract which is already pretty covering the topic
  • In general make sure you are not applying a heavy pre-processing. In neural-based sequence models, models are able to cope with some pre-processing like stop-word removal and you better keep them in your sentence as this is how the pertained models like BERT were actually trained
  • In case you are destroying sentences i.e. you extract some textual features like keywords from the abstract and tokenise them they will (to some extent) work like "query". In this case there are studies that shows longer queries decrease BERT performance (See H4). More sequentially informative your input is, more you get use of sequence models. If sequence is destroyed then everything is basically reduced to a combination of word vectors thus nothing more than non-sequence methods (ranging from TF-IDF to Word Embeddings)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.