How to interpret integrated gradients in an NLP toxic text classification use-case?

I am trying to understand how integrated gradients work in the NLP case.

Let $F: \mathbb{R}^{n} \rightarrow[0,1]$ a function representing a neural network, $x \in \mathbb{R}^{n}$ an input and $x' \in \mathbb{R}^{n}$ a reference. We consider the segment connecting $x$ to $x'$, and we compute the gradient at any point of this segment. The IG method is simply to sum these gradients. Thus, $I G$ in the ith dimension is given by the following formula:

$$ I G_{i}(x)=\left(x_{i}-x'_{i}\right) \frac{\int_{\alpha=0}^{1} d F(x'+\alpha(x-x \prime))}{d x_{i}} d \alpha $$

The advantage that IG has over other existing methods is that it satisfies the two axioms of sensitivity and implementation invariance that we detail in the next paragraph.

In the case of NLP, where $x$ may be a text, $F$ a toxicity classification algorithm, and $x'$ the reference (but what kind of reference? a non-toxic text? Or a toxic one?).

Topic explainable-ai gradient gradient-descent neural-network nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.