In transformers, do you understand why are the Value (V) vectors comes from the encoder? And than normalize with the query (Q) vector?
In transformers, there is a phase for rasidual connection, where the queries and the output from the attention are add and normalize. Can one please give some advise to the motivation of it? Or maybe I get it wrong? It seems to me that the values shouldn't come from the encoder, the values are the vector that we want to have attention on. And if so. We should have add and normalize the values from the previous state and not the queries... I'm confused..
Topic transformer spatial-transformer nlp
Category Data Science