Measuring precision and recall

Trying to improve my chat App:

Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:

Raw: "Hi John."

Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]


Of Course the results are not always relevant, for example:

Raw: "Hi John. How are you? I am fine, are you in the office?"

Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]

I am using Elasticsearch with TF/IDF similarity model and an index structured like so:

{
  "_index": "engagements",
  "_type": "context",
  "_id": "48",
  "_score": 1,
  "_source": {
    "context": "hi [[USER_NAME]] how are you i am fine are you in the office",
    "utterance": "Yes I am"
  }
}

Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.

Trying to use this excellent video, as a starting point

Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?

Topic chatbot classification nlp data-mining

Category Data Science


Precision and recall are "hard" metrics. They are measure if the model's prediction is exactly the same as the target label.

Often times systems like yours can use a more flexible metric such as top-5 error rate, the model is considered to have generated the correct response if the target label is one of the model’s top 5 predictions.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.