Measuring quality of answers from QnA systems

Question

Measuring quality of answers from QnA systems

Sandeep Bhutani

2020年1月21日 05:01

I am having a question answering system which is using Seq2Seq kind of architecture. Actually it is a transformer architecture. When a question is asked it gives startposition and endposition of answer along with their logits.
The answer is formed by choosing the best logits span and final probability is calculated by summing the start and end logits.

Now the problem is, I have multiple answer and many times the good answer is at 2nd or 3rd place (after sorting on the result of sum of start and end probability). Is there any metric in search engine science using which I can rank the best answers?

Followings have been tried:

cosine similarity between question words and answers - This works many times but fails when question semantic meaning is complex
TFIDF - gives good score but fails when there is synonym in answers rather than matching word.
gensim semantic similarity - fails badly.
BLUE score and new BERTF1Score also tried

Few terms I heard of but I doubt if these work, like Mean Reciprocal Rank which I think gives search quality rather than answer quality and also the correct response is required to calculate MRR (Please correct if I am wrong). Or the PageRank which is not valid in my case as the answer semantic meaning is preferred in QnA rather than the document popularity.

Kindly suggest other metrics which search engines generally use to rank the answers.

Topic question-answering bert transformer search-engine

Category Data Science

Erwan · Accepted Answer · 2019年12月21日 19:08

The ranking of the answers is part of the ML process, i.e. a system should be trained to rank the answers according to their relevance. Heuristic measures such as the ones mentioned in your question may offer decent approximations, but as you noticed they are very limited.

You may be interested in datasets and methods used in shared tasks about QA, for instance https://mrqa.github.io/shared.

Measuring quality of answers from QnA systems

About