Measuring precision and recall
Trying to improve my chat App:
Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:
Raw: "Hi John."
Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]
Of Course the results are not always relevant, for example:
Raw: "Hi John. How are you? I am fine, are you in the office?"
Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]
I am using Elasticsearch with TF/IDF similarity model and an index structured like so:
{
"_index": "engagements",
"_type": "context",
"_id": "48",
"_score": 1,
"_source": {
"context": "hi [[USER_NAME]] how are you i am fine are you in the office",
"utterance": "Yes I am"
}
}
Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.
Trying to use this excellent video, as a starting point
Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?
Topic chatbot classification nlp data-mining
Category Data Science