How Transformer is Bidirectional - Machine Learning

Asking question in datascience forum, as this forum seems well suited for data science related questions: https://stackoverflow.com/questions/55158554/how-transformer-is-bidirectional-machine-learning/55158766?noredirect=1#comment97066160_55158766

I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.

Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.

Can someone please explain what makes transformer bidirectional by nature

Answer received so far:
1. People confirmed that Transformer has Bidirectional nature, rather than an external code making it bidirectional.
2. My doubt: We are passing Q K V embeddings to transformer, to which it applies N layers of self attention using ScaledDotMatrix attention. Same thing can be done by unidirection approach as well. May I know what part I am missing in my understanding. If someone can point to code where it is getting bidirectional, it would be a great help.

Topic bert transformer machine-learning

Category Data Science


It is the encoder part of the Transformer model that is bidirectional in nature, not the whole model.

The full Transformer model has two parts: encoder and decoder. This encoder-decoder model is used for sequence-to-sequence tasks, like machine translation.

There are other tasks, however, that do not need the full model, but only one of its parts. For instance, for causal language modeling (e.g. GPT-2) we need the decoder. For masked language modeling (e.g. BERT), we need the encoder.

The decoder is designed so that each predicted token can only depend on the previous tokens. This is achieved with self-attention masking, and it is what makes the decoder unidirectional.

The encoder does not have self-attention masking. Therefore is designed not to have any dependency limitation: the token representation obtained at one position depends on all the tokens in the input. This is what makes the Transformer encoder bidirectional.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.