Working Behavior of BERT vs Transformers vs Self-Attention+LSTM vs Attention+LSTM on the scientific STEM data classification task?
So I just used BERT pre-trained with Focal Loss to classify Physics, Chemistry, Biology and Mathematics and got a good f-1 macro of 0.91. It is good given it only had to look for the tokens like triangle
, reaction
, mitochondria
and newton
etc in a broader way. Now I want to classify the the Chapter Name also. It is a bit difficult task because when I trained it on BERT for 208 classes, my score was almost 0. Why? I an get that there are lots of information also like nacl: sodium chloride
, bohr model
9.8 m/sec
etc which I think BERT is not trained for. I want to ask few questions.
- Is BERT useful in these conditions? Is it trained on scientific terms. I mean can it get the context of
Schrödinger equation
toPlank's Constant
? If not, I don't think I should use it because I don't have enough data to re-train BERT. Anything but BERT - Can I use FastText or GloVe? Cn they get the meaning or context?
- Or should I simply create my own embeddings in
pytorch/keras
and keepnacl,fe,ppm
as they are and hope either ofTransformer
orAttention
will capture it?
Please help. I have a data of 120K questions/data points.
Topic attention-mechanism lstm deep-learning nlp machine-learning
Category Data Science