Self-Attention Summation and Loss of Information

In self-attention, the attention for a word is calculated as:

$$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{i})}{\sum_{j} exp(q.k^{j})}v^{i} $$

My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration?

In other words, how does this summed vector point to which words are relevant?

For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{t}$, and one where it depends on the vector of word $x^{t+1}$. It's possible that $A(q, K, V)$ has the exact same values in both scenarios.

Topic transformer attention-mechanism information-theory deep-learning

Category Data Science


Here is a paragraph from Speech and Language Processing Ch 9.

to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, $\alpha_{ij}$, that indicates the proportional relevance of each input to the input element $i$ that is the current focus of attention.

My intuition; the same reason for normalizing between layers. Speeds up and stabilizes the learning process by keeping the output (input for next layer) consistent with in a range.


I guess there is an assumption that $A(q, K, V)$ represents similarity between q and the words in the sentence. Actually $A(q, K, V)$ is encoding the relationships from q to the words in the sentence.

The information about which other words in particular are important to the word under consideration has been already encoded by $q \cdot k^{<i>}$.


Suppose there is a sentence i love sushi, $v^{<i>}$ is a positionally-encoded word-embedding vector for each word, and the word under consideration $q = v^{<1>}$ which corresponds with love.

$v^{<0>} = i\\ v^{<1>} = love\\ v^{<2>} = sushi$

$scale(i)=\frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}$ quantifies how strong the connection is between love and $v^{<i>}$.

$\frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>}$ or $scale(i)v^{<i>}$ is scaling each vector $v^{<i>}$ proportionally based on how strongly related $v^{<i>}$ is to love.

Aggregating all the scaled vectors $\sum_{i} scale(i)v^{<i>}$ creates a new vector $A$, which is encoding the relationships from q to the words in the sentence.

enter image description here

The vector $A$ goes through the neural network and the back-propagation on the scaling and aggregation operations are how the Transformer learns the relations between q and the sentence.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.