reliability of human-level evaluation of the interpretability quality of a model
Christoph Molnar, in his book Interpretable Machine Learning, writes that
Human level evaluation (simple task) is a simplified application level evaluation. The difference is that these experiments are not carried out with the domain experts, but with laypersons. This makes experiments cheaper (especially if the domain experts are radiologists) and it is easier to find more testers. An example would be to show a user different explanations and the user would choose the best one.
(Chapter = Interpretability , section = Approaches for Evaluating the Interpretability Quality).
Why would anyone pick/trust a human-backed(not an expert) model over, say a domain-expert backed model or even a functionally evaluated model(i.e. accuracy/precision/recall/f1-score etc. are considerably good)?
Topic methodology machine-learning
Category Data Science