Why do Transformers need positional encodings?
At least in the first self-attention layer in the encoder, inputs have a correspondence with outputs, I have the following questions.
- Isn't ordering already implicitly captured by the query vectors, which themselves are just transformations of the inputs?
- What do the sinusoidal positional encodings capture that the ordering of the query vectors don't already do?
- Am I perhaps mistaken in thinking that transformers take in the entire input at once?
- How are words fed in?
- If we feed in the entire sentence at once, shouldn't the ordering be preserved?
Topic transformer deep-learning neural-network nlp machine-learning
Category Data Science