Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

Question

Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

ThirtyOneTwentySeven

2022年6月3日 20:30

I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?

Topic transformer word-embeddings machine-learning

Category Data Science

Yoan B. M.Sc · Accepted Answer · 2022年6月3日 20:30

This is my understanding, feel free to correct me, I feel looking at how n visually impact the positional encoding matrix helpful.

Here is the same positional encoding (sequence length=100 and dimension=512) but with different values of n

for = 10000

for n = 100

for n = 20

and for n = 1

If you think about the purpose of positional encoding: providing a unique vector representation encoding the position of each token, it means each row represent a position and the further apart two positions the greater the distance between this vector.

Now, look what happens when you compute the cosine distance between the first vector and each remaining one:

orange: n = 10000 blue: n = 20 green: n = 1

With large n the distance is a monotonically increasing function of the position (in term of index in the sequence) which make sense since we're talking about position of token. But it's not the case with smaller value of n.

Now I don't know if there's any rule regarding how to set an appropriate n depending on the other parameters.

Ryan Y · Accepted Answer · 2021年2月19日 03:47

Amirhossein's blog post explains the intuition for positional encoding very well.

My takeaway from the blog is that: Consider just a pair of sinusoids (sine and cosine). Suppose we are within 1 full cycle (e.g. 0 to 2pi), the resulting encoding is basically guaranteed to be unique. I.e. there is a 1-to-1 mapping from real numbers (1, 1.5, 2, 2.34,etc.), x, to the pair of (sin(x), cos(x)). Thus, encoded to this 2-tuple vector.

Therefore, the purpose of the 10000 is probably just to make sure the full cycle is extremely large.

If you plot sin(1/10000*x), you will observe that to complete 1 full cycle, x >50k. This is more than sufficient to encode the words which would probably have <1k words.

Note that according to the original equations, the 1st pair of sinusoids is just sin(pos) and cos(pos). In this case, the positional encoding is arguably not unique. i.e. There may be two positions (note: they are integers) that has the same encoding, though this may only occur if we have a very long sentence such that the positions have exactly the same encoding. Though one may argue that since pi is irrational, it is unlikely any integer positions will have same encoding..

Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

About