What is the difference between BERT architecture and vanilla Transformer architecture
I'm doing some research for the summarization task and found out BERT is derived from the Transformer model. In every blog about BERT that I have read, they focus on explaining what is a bidirectional encoder, So, I think this is what made BERT different from the vanilla Transformer model. But as far as I know, the Transformer reads the entire sequence of words at once, therefore it is considered bidirectional too. Can someone point out what I'm missing?
Topic encoder bert transformer nlp
Category Data Science