Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do away with decoders completely?

Topic bert attention-mechanism machine-translation nlp

Category Data Science


First need to understand what problems BERT can solve or what kind of inference/prediction it can achieve.

enter image description here

Encoder in Transformer itself can learn:

  1. Relations among words (what word is most probable in a context). For instance, what word will fit in the BLANK in the context I take [BLANK] of the opportunity.

  2. Relations among sentences. For instance A: "In a glossary store" can follow B: "I bought the ingredients".

Having these traits or capabilities, BERT can predict a word to follow a sequence of words. BERT can classify if a text is negative or positive. As far as you can achieve the predictions you want with the Encoder part only, you do not need the Decoder.

Hence it would be better focus on what problems require Decoder. Or what problems BERT cannot solve.


BERT is a stack of deep bidirectional transformer encoders that read the input sequence and generate meaning representations called embeddings. It uses multi-head attention to decide the meaning.


The need for an encoder depends on what your predictions are conditioned on, e.g.:

  • In causal (traditional) language models (LMs), each token is predicted conditioning on the previous tokens. Given that the previous tokens are received by the decoder itself, you don't need an encoder.
  • In Neural Machine Translation (NMT) models, each token of the translation is predicted conditioning on the previous tokens and the source sentence. The previous tokens are received by the decoder, but the source sentence is processed by a dedicated encoder. Note that this is not necessarily this way, as there are some decoder-only NMT architectures, like this one.
  • In masked LMs, like BERT, each masked token prediction is conditioned on the rest of the tokens in the sentence. These are received in the encoder, therefore you don't need an decoder. This, again, is not a strict requirement, as there are other masked LM architectures, like MASS that are encoder-decoder.

In order to make predictions, BERT needs some tokens to be masked (i.e. replaced with a special [MASK] token. The output is generated non-autoregressively (every token at the output is computed at the same time, without any self-attention mask), conditioning on the non-masked tokens, which are present in the same input sequence as the masked tokens.


BERT is a pretraining model to do the downstream tasks such as question answering, NLI and other language tasks. So it just needs to encode the language representations so that it could be used for other tasks.That's why it consists only of encoder parts. You can add the decoder while doing your specific task and this decoder could be anything based on your task.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.