Why calculating how much removed sentences with most contributing words to the result helps to show that a model is "*faithful*"?

I don't understand how the calculation score taking out the sentences where the words contribute the most of to the result helps to show to what extent a model is faithful to a reasoning process.

Indeed, a faithfulness score was proposed by Du et al. in 2019 to verify the importance of the identified contributing sentences or words to a given model’s outputs. It is assumed that the probability values for the predicted class will significantly drop if the truly important inputs are removed. The score is calculated as :

$$S_{Faithfullness} = \frac{1}{N}\sum{(y_{x^i} - y_{x^i_{a}})}$$

Where $(y_{x^i})$ is the predicted probability for a given target class with original inputs and $(y_{x^i_{a}})$ is the predicted probability for the target class for the input with significant sentences/words removed. This metric is available in AIX360.

Yet, if faithfulness measures how well an interpretation method relates to the actual reasoning process used by the model it is ‘interpreting’, I don't get why faithfulness should be such a method such that seems to rely more on attention weight examination.

Topic explainable-ai metric nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.