Why transform embedding dimension in sin-cos positional encoding?
Positional encoding using sine-cosine functions is often used in transformer models.
Assume that $X \in R^{l\times d}$ is the embedding of an example, where $l$ is the sequence length and $d$ is the embedding size. This positional encoding layer encodes $X$’s position $P \in R^{l\times d}$ and outputs $P + X$
The position $P$ is a 2-D matrix, where $i$ refers to the order in the sentence, and $j$ refers to the position along the embedding vector dimension. In this way, each value in the origin sequence is then maintained using the equations below:
$${P_{i, 2j} = \sin \bigg( \frac{i}{10000^{2j/d}}} \bigg) $$
$${P_{i, 2j+1} = \cos \bigg( \frac{i}{10000^{2j/d}}} \bigg)$$
for $i = 0,..., l-1$ and $j=0,...[(d-1)/2]$
I understand the transormation across the time dimension $i$ but why do we need the transformation across the embedding size dimension $j$? Since we are adding the position, wouldn't sin-cos just on time dimension be sufficient to encode the position?
EDIT
Answer 1 - Making the embedding vector independent from the embedding size dimension would lead to having the same value in all positions, and this would reduce the effective embedding dimensionality to 1.
I still don't understand how the embedding dimensionality will be reduced to 1 if the same positional vector is added. Say we have an input $X$ of zeros with 4 dimensions - $d_0, d_1, d_2, d_3$ and 3 time steps - $t_0, t_1, t_2$
$$ \begin{matrix} d_0 d_1 d_2 d_3\\ t_0 0 0 0 0\\ t_1 0 0 0 0\\ t_2 0 0 0 0\\ \end{matrix} $$
If $d_0$ and $d_2$ are the same vectors $[0, 0, 0]$, and the meaning of position i.e time step is the same, why do they need to have different positional vectors? Why can't $d_0$ and $d_2$ be the same after positional encoding if the input $d_0$ and $d_2$ are the same?
As for the embedding dimensionality reducing to 1, I don't see why that would happen. Isn't the embedding dimensionality dependent on the input matrix $X$. If I add constants to it, the dimensionality will not change, no?
I may be missing something more fundamental here and would like to know where am I going wrong.
Topic encoder transformer
Category Data Science