Why do BERT classification do worse with longer sequence length?
I've been experimenting using transformer networks like BERT for some simple classification tasks. My tasks are binary assignment, the datasets are relatively balanced, and the corpus are abstracts from PUBMED. The median number of tokens from pre-processing is about 350 but I'm finding a strange result as I vary the sequence length. While using too few tokens hampers BERT in a predictable way, BERT doesn't do better with more tokens. It looks like the optimal number of tokens is about 128 and consistently performs worse as I give it more of the abstract.
What could be causing this, and how can I investigate it further?
Topic bert transformer hyperparameter-tuning hyperparameter deep-learning
Category Data Science