reliability of human-level evaluation of the interpretability quality of a model

Question

reliability of human-level evaluation of the interpretability quality of a model

Akshay Prabhakant

2021年3月27日 10:04

Christoph Molnar, in his book Interpretable Machine Learning, writes that

Human level evaluation (simple task) is a simplified application level evaluation. The difference is that these experiments are not carried out with the domain experts, but with laypersons. This makes experiments cheaper (especially if the domain experts are radiologists) and it is easier to find more testers. An example would be to show a user different explanations and the user would choose the best one.

(Chapter = Interpretability , section = Approaches for Evaluating the Interpretability Quality).

Why would anyone pick/trust a human-backed(not an expert) model over, say a domain-expert backed model or even a functionally evaluated model(i.e. accuracy/precision/recall/f1-score etc. are considerably good)?

Topic methodology machine-learning

Category Data Science

WBM · Accepted Answer · 2021年3月27日 09:12

This is specifically for interpretability of outcomes, i.e. a task where non-expert humans outperform machines.

There is a problem in collecting labels in machine learning, whereby labelling datasets is very expensive and time consuming (due to size of datasets & cost of experts' time).

So it's less about trust, its more about practicality. Consider hiring a data scientist to develop an algorithm to automatically label a dataset based of expert heuristics (e.g. "label the data as cancerous if it looks red"), it might take 6 months to collect data, plan, develop & test - therefore for certain use-cases hiring 10 non-experts and telling them the heuristic might be cheaper and faster.

The book uses an example "show a user different explanations and the human would choose the best." in the context of radiology, it could be something like: "Look at the images of the patient, and compare it to this dictionary of images and diagnoses, combine multiple sources and then report what the diagnosis is"

Of course if you have an algorithm which outperforms non-experts, you might just want some expert labels to validate your algorithm, and forget the non-experts.

reliability of human-level evaluation of the interpretability quality of a model

About