Transformer model comparison for binary sentiment classification

On two independent datasets, I am comparing XLNet and BERT models with binary sentiment classification tasks: the Twitter dataset, where sentences are short, and the IMDB review dataset, where sentences are long.

On the Twitter dataset, BERT matches and slightly outperforms XLNet, but XLNet outperforms BERT on the IMDB dataset. I understand that XLNet captures longer dependencies due to the Transformer XL architecture and so outperforms BERT; but, what additional reasons may exist for one to outperform the other for a certain dataset? Why is BERT more successful, or at least comparable to XLNet, in classifying social media sentiment?

Topic binary-classification bert sentiment-analysis language-model nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.