In transformers, do you understand why are the Value (V) vectors comes from the encoder? And than normalize with the query (Q) vector?

Question

In transformers, do you understand why are the Value (V) vectors comes from the encoder? And than normalize with the query (Q) vector?

eran

2020年12月5日 17:40

In transformers, there is a phase for rasidual connection, where the queries and the output from the attention are add and normalize. Can one please give some advise to the motivation of it? Or maybe I get it wrong? It seems to me that the values shouldn't come from the encoder, the values are the vector that we want to have attention on. And if so. We should have add and normalize the values from the previous state and not the queries... I'm confused..

Topic transformer spatial-transformer nlp

Category Data Science

noe · Accepted Answer · 2020年12月5日 17:40

In order to understand how the attention block works maybe this analogy helps: think of the attention block as a Python dictionary, e.g.

keys =   ['a', 'b', 'c']
values = [2, 7, 1]
attention = {keys[0]: values[0], keys[1]: values[1], keys[2]: values[2]}
queries = ['c', 'a']
result = [attention[queries[0]], attention[queries[1]]]

In the code above, result should have value [1, 2].

The attention from the transformer works in a similar way, but instead of having hard matches, it has soft maches: it gives you a combination of the values weighting them according to how similar their associated key is to the query.

In the encoder-decoder attention blocks, the keys and values are the encoder output and the queries are the decoder states. The logic behind this is that the new hidden states in the decoder are a combination of the states of the encoder (i.e. the source sentence representations) weighted by their similarity (scaled dot product) with the partially decoded target sentence representations.

The residual connection and normalization elements are common processing steps applied after each multi-head attention and positional feedforward layer, independently from the origin of queries, keys, and values used in it:

P.S.: some parts of this answer are reused from my answer to another question.

In transformers, do you understand why are the Value (V) vectors comes from the encoder? And than normalize with the query (Q) vector?

About