Working Behavior of BERT vs Transformers vs Self-Attention+LSTM vs Attention+LSTM on the scientific STEM data classification task?

So I just used BERT pre-trained with Focal Loss to classify Physics, Chemistry, Biology and Mathematics and got a good f-1 macro of 0.91. It is good given it only had to look for the tokens like triangle, reaction, mitochondria and newton etc in a broader way. Now I want to classify the the Chapter Name also. It is a bit difficult task because when I trained it on BERT for 208 classes, my score was almost 0. Why? I an get that there are lots of information also like nacl: sodium chloride , bohr model 9.8 m/sec etc which I think BERT is not trained for. I want to ask few questions.

  1. Is BERT useful in these conditions? Is it trained on scientific terms. I mean can it get the context of Schrödinger equation to Plank's Constant? If not, I don't think I should use it because I don't have enough data to re-train BERT. Anything but BERT
  2. Can I use FastText or GloVe? Cn they get the meaning or context?
  3. Or should I simply create my own embeddings in pytorch/keras and keep nacl,fe,ppm as they are and hope either of Transformer or Attention will capture it?

Please help. I have a data of 120K questions/data points.

Topic attention-mechanism lstm deep-learning nlp machine-learning

Category Data Science


A1. BERT by itself may not be useful for scientific terms. You have 2 choices, either find a pre-trained embedding specific to the scientific text database or use transfer learning and build upon BERT.

A2. Fastext or Glove will have the same issue as BERT

A3. You have mentioned that BERT works well at a high level. So I would not advice you to create embeddings from scratch and instead use transfer learning to enhance the embeddings

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.