What's the right input for gpt-2 in NLP

I'm fine-tuning pre-trained gpt-2 for text summarization. The dataset contains 'text' and 'reference summary'. So my question is how to add special tokens to get the right input format. Currently I'm thinking doing like this:

example1 BOS text SEP reference summary EOS ,

example2 BOS text SEP reference summary EOS , .....

Is this correct? If so, a follow-up question would be whether the max-token-length(i.e. 1024 for gpt-2) means also the concatenate length of text and reference summary?

Any comment would be very much appreciated!

Topic openai-gpt transformer data-science-model nlp

Category Data Science


Updated answer

After reading @Jessica's answer, I carefully read the original GPT-2 paper and I confirm that the authors do not add special tokens, but simply the text TL;DR: (be careful to include the :, which is not present in the referenced answer) after the text to summarize. Therefore, the format would be:

<|endoftext|> + article + " TL;DR: " + summary + <|endoftext|>

Note that in the original GPT-2 vocabulary there is no <|startoftext|> token, and neither there is in the Huggingface implementation.

Old answer

GPT-2 is a causal language model. This means that, by default, it receives either no input at all or the initial tokens of a sentence/paragraph. It then completes whatever it was passed as input. Therefore, it is not meant to be used the way you are trying to do it.

Normally, in order to do conditional text generation, people use an encoder-decoder architecture, that is, a full encoder-decoder Transformer instead of GPT-2, which only has the decoder part.

Nevertheless, while it was not meant to work the way you are using it, it is possible that this works. This kind of thing has been done before, for instance, in this NeurIPS 2018 article that uses only a Transformer decoder for machine translation, concatenating source and target sides, like you do:

enter image description here

You would need, nevertheless, to perform some adaptations. Specifically, the original GPT-2 vocabulary does not have the special tokens you use. Instead, it only has <|endoftext|> to mark the end. This means that if you want to use your special tokens, you would need to add them to the vocabulary and get them trained during fine-tuning. Another option is to simply use <|endoftext|> in the places of your <BOS>, <SEP> and <EOS>.

For GPT-2, there is only a single sequence, not 2. Therefore, the maximum token length would apply for the concatenation of text and reference summary.

P.S.: I think that your use of <SEP> comes from the fact that other non-generative models like BERT, use similar special tokens ([SEP], [CLS]) and are specifically designed to receive two concatenated segments as input. However, BERT is not a generative language model, in the sense that it was not trained in an autoregressive manner, but with a masked LM loss:

enter image description here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.