Understanding Transformer's Self attention calculations

Question

Understanding Transformer's Self attention calculations

MAC

2022年4月22日 01:08

I was going through this link: https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blogutm_medium=demystifying-bert-groundbreaking-nlp-framework#comment-160771

What is the value of Key, Value in the self attention calculation of Transformer model?

Query vector is embedding vector for the word that is queried, is that right?

Is attention calculated in RNN is different from self attention in Transformer?

Topic transformer attention-mechanism

Category Data Science

Ashwin Geet D'Sa · Accepted Answer · 2020年11月12日 22:58

What is the value of Key, Value in the self attention calculation of Transformer model?

--> Every transformer block has these 3 set of weights: Query, Key, Value (Q, K, V). These weights determine the values. So, if the Q,K,V for each token is value obtained by multiplying it's embeddings with weights(Wq, Wk, Wv).

Query vector is embedding vector for the word that is queried, is that right?

--> Quite close, it's not the word embeddings, but value obtained by multiplying the embeddings of token (token embeddings + positional embeddings), with Query matrix.

Is attention calculated in RNN is different from self attention in Transformer?

--> Not much different, but they have little difference. In RNN, attentions are set of scalar weights learnt to determine which parts of inputs are more relevant. (by using a softmax you have sum of these attention weights equal to 1). Hover, in transformers attention is softmax over the value vectors obtained by multiplying query vector of a given token with the value vector of all the tokens. So, attention here is more to see which are the more relavent tokens for a single token.

tschomacker · Accepted Answer · 2020年11月9日 19:21

What is the value of Key, Value in the self attention calculation of Transformer model?

You have the queries matrix Q. Imagined it to a coordinate system. the x-axis is the key and y-axis is the resulting value.

Query vector is embedding vector for the word that is queried, is that right?

Yes, for the word the word or token

Is attention calculated in RNN is different from self attention in Transformer?

RNNs is a family of neural networks. They don't include attention per se.

A resource, that helped me greatly is: https://jalammar.github.io/illustrated-transformer/ maybe it will you also to better understand transformers.

Understanding Transformer's Self attention calculations

About