Measuring quality of answers from QnA systems
I am having a question answering system which is using Seq2Seq kind of architecture. Actually it is a transformer architecture. When a question is asked it gives startposition and endposition of answer along with their logits.
The answer is formed by choosing the best logits span and final probability is calculated by summing the start and end logits.
Now the problem is, I have multiple answer and many times the good answer is at 2nd or 3rd place (after sorting on the result of sum of start and end probability). Is there any metric in search engine science using which I can rank the best answers?
Followings have been tried:
- cosine similarity between question words and answers - This works many times but fails when question semantic meaning is complex
- TFIDF - gives good score but fails when there is synonym in answers rather than matching word.
- gensim semantic similarity - fails badly.
- BLUE score and new BERTF1Score also tried
Few terms I heard of but I doubt if these work, like Mean Reciprocal Rank which I think gives search quality rather than answer quality and also the correct response is required to calculate MRR (Please correct if I am wrong). Or the PageRank which is not valid in my case as the answer semantic meaning is preferred in QnA rather than the document popularity.
Kindly suggest other metrics which search engines generally use to rank the answers.
Topic question-answering bert transformer search-engine
Category Data Science