Amirhossein's blog post explains the intuition for positional encoding very well.
My takeaway from the blog is that: Consider just a pair of sinusoids (sine and cosine). Suppose we are within 1 full cycle (e.g. 0 to 2pi), the resulting encoding is basically guaranteed to be unique. I.e. there is a 1-to-1 mapping from real numbers (1, 1.5, 2, 2.34,etc.), x, to the pair of (sin(x), cos(x)). Thus, encoded to this 2-tuple vector.
Therefore, the purpose of the 10000
is probably just to make sure the full cycle is extremely large.
If you plot sin(1/10000*x)
, you will observe that to complete 1 full cycle, x >50k. This is more than sufficient to encode the words which would probably have <1k words.
Note that according to the original equations, the 1st pair of sinusoids is just sin(pos)
and cos(pos)
. In this case, the positional encoding is arguably not unique. i.e. There may be two positions (note: they are integers) that has the same encoding, though this may only occur if we have a very long sentence such that the positions have exactly the same encoding. Though one may argue that since pi
is irrational, it is unlikely any integer positions will have same encoding..