Why calculating how much removed sentences with most contributing words to the result helps to show that a model is "*faithful*"?
I don't understand how the calculation score taking out the sentences where the words contribute the most of to the result helps to show to what extent a model is faithful to a reasoning process.
Indeed, a faithfulness score was proposed by Du et al. in 2019 to verify the importance of the identified contributing sentences or words to a given model’s outputs. It is assumed that the probability values for the predicted class will significantly drop if the truly important inputs are removed. The score is calculated as :
$$S_{Faithfullness} = \frac{1}{N}\sum{(y_{x^i} - y_{x^i_{a}})}$$
Where $(y_{x^i})$ is the predicted probability for a given target class with original inputs and $(y_{x^i_{a}})$ is the predicted probability for the target class for the input with significant sentences/words removed. This metric is available in AIX360.
Yet, if faithfulness measures how well an interpretation method relates to the actual reasoning process used by the model it is ‘interpreting’, I don't get why faithfulness should be such a method such that seems to rely more on attention weight examination.
Topic explainable-ai metric nlp
Category Data Science