Self-Attention Summation and Loss of Information
In self-attention, the attention for a word is calculated as:
$$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{i})}{\sum_{j} exp(q.k^{j})}v^{i} $$
My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration?
In other words, how does this summed vector point to which words are relevant?
For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{t}$, and one where it depends on the vector of word $x^{t+1}$. It's possible that $A(q, K, V)$ has the exact same values in both scenarios.
Topic transformer attention-mechanism information-theory deep-learning
Category Data Science