Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?

Topic transformer word-embeddings machine-learning

Category Data Science


This is my understanding, feel free to correct me, I feel looking at how n visually impact the positional encoding matrix helpful.

Here is the same positional encoding (sequence length=100 and dimension=512) but with different values of n

for = 10000 enter image description here

for n = 100 enter image description here

for n = 20 enter image description here

and for n = 1 enter image description here

If you think about the purpose of positional encoding: providing a unique vector representation encoding the position of each token, it means each row represent a position and the further apart two positions the greater the distance between this vector.

Now, look what happens when you compute the cosine distance between the first vector and each remaining one:

enter image description here

orange: n = 10000 blue: n = 20 green: n = 1

With large n the distance is a monotonically increasing function of the position (in term of index in the sequence) which make sense since we're talking about position of token. But it's not the case with smaller value of n.

Now I don't know if there's any rule regarding how to set an appropriate n depending on the other parameters.


Amirhossein's blog post explains the intuition for positional encoding very well.

My takeaway from the blog is that: Consider just a pair of sinusoids (sine and cosine). Suppose we are within 1 full cycle (e.g. 0 to 2pi), the resulting encoding is basically guaranteed to be unique. I.e. there is a 1-to-1 mapping from real numbers (1, 1.5, 2, 2.34,etc.), x, to the pair of (sin(x), cos(x)). Thus, encoded to this 2-tuple vector.

Therefore, the purpose of the 10000 is probably just to make sure the full cycle is extremely large.

If you plot sin(1/10000*x), you will observe that to complete 1 full cycle, x >50k. This is more than sufficient to encode the words which would probably have <1k words.

Note that according to the original equations, the 1st pair of sinusoids is just sin(pos) and cos(pos). In this case, the positional encoding is arguably not unique. i.e. There may be two positions (note: they are integers) that has the same encoding, though this may only occur if we have a very long sentence such that the positions have exactly the same encoding. Though one may argue that since pi is irrational, it is unlikely any integer positions will have same encoding..

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.