BERT vs GPT architectural, conceptual and implemetational differences

Question

BERT vs GPT architectural, conceptual and implemetational differences

Rnj

2021年11月30日 10:46

In the BERT paper, I learnt that BERT is encoder-only model, that is it involves only transformer encoder blocks.

In the GPT paper, I learnt that GPT is decoder-only model, that is it involves only transformer decoder blocks.

I was guessing whats the difference. I know following difference between encoder and decoder blocks: GPT Decoder looks only at previously generated tokens and learns from them and not in right side tokens. BERT Encoder gives attention to tokens on both sides.

But I have following doubts:

Q1. GPT2,3 focuses on new/one/zero short learning. Cant we build new/one/zero short learning model with encoder-only architecture like BERT?

Q2. Huggingface Gpt2Model contains forward() method. I guess, feeding single data instance to this method is like doing one shot learning?

Q3. I have implemented neural network model which utilizes output from BertModel from hugging face. Can I simply swap BertModel class with GPT2Model with some class and will it. The return value of Gpt2Model.forward does indeed contain last_hidden_state similar to BertModel.forward. So, I guess swapping out BertModel with Gpt2Model will indeed work, right?

Q4. Apart from being decoder-only and encoder-only, auto-regressive vs non-auto-regressive and whether or not accepting tokens generated so far as input, what high level architectural / conceptual differences GPT and BERT have?

Topic openai-gpt bert transformer nlp machine-learning

Category Data Science

Jindřich · Accepted Answer · 2021年11月30日 10:46

To start with your last question: you correctly say that BERT is an encoder-only model trained with the masked language-modeling objective and operates non-autoregressively. GPT-2 is a decode-only model trained using the left-to-right language objective and operates autoregressively. Other than that, there are only technical differences in hyper-parameters, but no other conceptual differences.

BERT (other masked LMs) could also be used for zero- or few-shot learning, but in a slightly different way. There is a method called PET (Pattern- Exploiting Training). It uses the language modeling abilities of BERT via templates. E.g., for sentiment analysis, you can do something like:

<...text of the review..><.TEMPLATE......> <  ?  >.
The pizza was fantastic. The restaurant is [MASK].

Then you check what score was would good and bad get at the position of the [MASK] token.

Working with the GPT-2 model is not that straightforward as with BERT. Calling the forward method returns the hidden states of GPT-2 given the input you provided that can be further used in a model. You can use hidden states of GPT-2 as contextual embeddings, the same way that you the output of BERT, however, this is not how GPT-2 is usually used.

The usual way of using GPT-2 sampling from the model. This means that you provide a prompt (as plain text) and hope that the model will continue in a reasonable way. There are many tutorials on how to generate from the GPT-2 models, e.g., this blog post by Huggingface.

BERT vs GPT architectural, conceptual and implemetational differences

About