Question answering bot: EM>F1, does it make sense?
I am fine-tuning a Question Answering bot starting from a pre-trained model from HuggingFace repo.
The dataset I am using for the fine-tuning has a lot of empty answers. So, after the fine tuning, when I'm evaluating the dataset by using the model just created, I find that the EM score is (much) higher than the F1 score. (I know that I must not use the same dataset for training and evaluation, it was just a quick test to see that everything is running)
I assume that this happens because every question with no real answer is a match when the model cannot predict an answer, but as a non-expert of NLP I wonder does this makes sense? is it theoretically possible or am I missing something big?
Topic huggingface question-answering f1score nlp
Category Data Science