Class token in ViT and BERT
I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT.
To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens?
Treating the class token like the rest of the tokens means other tokens can attend to it. I'd expect that the class token will be able to attend other tokens while they could not attend it.
Also, specifically in ViT, why does the class token receive positional encodings? It represents the entire class and thus doesn't have any specific location.
Thanks!
Topic attention-mechanism computer-vision deep-learning nlp machine-learning
Category Data Science