Options to find the most similar question in a dataset of question-answer pairs?

I am building a chatbot that will only handle FAQs, but these FAQs are very specific to an organisation, so I cannot use any existing off-the-shelf solutions, or connect to question-answering APIs.

I have a dataset which consists of questions, intents, and answers.

Let's say there are 100 intents, which basically group questions into general categories (e.g. fee_payment). Each intent has 50 different specific answers (e.g. 'Fees are paid through the online portal' or 'Fees are due on the 1st of every month'), and finally each answer has 10 sample questions (e.g. 'How do I pay my fees?' or 'What is the procedure for payment of fees?')

Here is an example of the structure of the dataset:

| intent      | answer                                  | question     
-----------------------------------------------------------------------------------------------------
| fee_payment | Fees are paid through the online portal | How do I pay my fees?                      |
| fee_payment | Fees are paid through the online portal | What is the procedure for payment of fees? |
| fee_payment | Fees are paid through the online portal | .......................................... |
| fee_payment | Fees are paid through the online portal | intent1_answer1_question10               |
| fee_payment | Fees are due on the 1st of every month  | intent1_answer2_question1                |
| fee_payment | Fees are due on the 1st of every month  | .......................................... |
| fee_payment | Fees are due on the 1st of every month  | intent1_answer2_question10               |
| fee_payment | ....................................... | .......................................... |
| fee_payment | intent1_answer50                      | intent1_answer50_question1               |
| fee_payment | intent1_answer50                      | .......................................... |
| fee_payment | intent1_answer50                      | intent1_answer50_question10              |
| ........... | ....................................... | .......................................... |
| intent100 | intent100_answer1                     | intent100_answer1_question1              |
| intent100 | intent100_answer1                     | .......................................... |
| intent100 | intent100_answer1                     | intent100_answer1_question10             |
| intent100 | ....................................... | .......................................... |
| intent100 | intent100_answer50                    | intent100_answer50_question1             |
| intent100 | intent100_answer50                    | .......................................... |
| intent100 | intent100_answer50                    | intent100_answer50_question10            |

I have started building the chatbot using Rasa and have the NLU part trained to accurately classify any arbitrarily worded question as one of the 100 intents. I now need to explore what is the best method to take all the questions in the dataset under this intent, and find the most similar/relevant one to the user's question.

What are the best options to acheive this? Would something as powerful as BERT be effective here? What are some other simpler lookup/search based options, or supervised/unsupervised ML/DL options, or ...?

Note: I am doing this in English initially, but eventually it will need to be language-agnostic (Arabic will be implemented next).

EDIT: The dataset also has a column of keywords for each question.

Topic question-answering knowledge-base similarity search machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.