Question answering bot: EM>F1, does it make sense?

I am fine-tuning a Question Answering bot starting from a pre-trained model from HuggingFace repo.

The dataset I am using for the fine-tuning has a lot of empty answers. So, after the fine tuning, when I'm evaluating the dataset by using the model just created, I find that the EM score is (much) higher than the F1 score. (I know that I must not use the same dataset for training and evaluation, it was just a quick test to see that everything is running)

I assume that this happens because every question with no real answer is a match when the model cannot predict an answer, but as a non-expert of NLP I wonder does this makes sense? is it theoretically possible or am I missing something big?

Topic huggingface question-answering f1score nlp

Category Data Science


Metrics for Q&A

  • F1 score: Captures the precision and recall that words chosen as being part of the answer are actually part of the answer
  • EM Score(exact match): which is the number of answers that are exactly correct (with the same start and end index). EM is 1 when characters of model prediction exactly matches True answers.

The above scores are computed on individual Q&A pairs. When multiple correct answers are possible for a given question, the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.

Understanding the basics often solve the questions we are looking for. That being said, you mentioned two things

  1. Dataset has lot of empty answers
  2. Same dataset you have used for both training and evaluation [that is the real performance of model is yet to be estimated on a separate dataset]

As no Information of sample dataset nor sample code was provided from your end. Its up to you to find out why EM score is (much) higher than the F1 score by ruling out your assumptions with the elimination rule.

  • check with dataset with actual answers but use the same dataset for training and evaluation to confirm the EM score issue is due to dataset having lot of empty answers
  • check with separate datasets for training and evaluation, keeping questions with lot of empty answers. Though I quite agree it couldn't be the reason what better way to rule it out by confirming.

Analyse the EM Scores, F1 scores for both scenarios to completely rule out assumptions


EM (exact match) and F1 scores are typically calculated on different levels. EM is calculated on the character level. F1 is calculated on individual word level.

Almost always, EM will be lower than F1. There is a good chance something is incorrect in the code.

You should confirm your assumption by calculating the EM and F1 scores separately for empty answers and non-empty answers.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.