Why is the decoder not a part of BERT architecture?
I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do away with decoders completely?
Topic bert attention-mechanism machine-translation nlp
Category Data Science