Training Objective of language model for GPT3

On page 34 of OpenAI's GPT-3, there is a sentence demonstrating the limitation of objective function:

Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important.

I am not sure if I understand this correctly. In my understanding, the objective function is to maximize the log-likelihood of the token to predict given the current context, i.e., $\max L \sim \sum_{i} \log P(x_{i} | x_{i})$. Although we aim to predict every token that appears in the training sentence, the tokens have a certain distribution based on the appearance in human litterature, and therefore we do not actually assign equal weight to every token in loss optimization.

And what should be an example for a model to get the notion of what is important and what is not. What is the importance refer to in here? For example, does it mean that the is less important compared to a less common noun, or does it mean that the current task we are interested in is more important than the scenario we are not interested in ?

Any idea how to understand the sentence by OpenAI?

Topic openai-gpt language-model nlp

Category Data Science


This may be best understood with a bit more of context from the article:

A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest.

I think that the relevant part of the reference [RRS20] is this paragraph:

Recently, Guu et al.(2020) found that a “salient span masking” (SSM) pre-training objective produced substantially better results in open-domain question answering. This approach first uses BERT (Devlin et al., 2018) to mine sentences that contain salient spans (named entities and dates) from Wikipedia. The question answering model is then pre-trained to reconstruct masked-out spans from these sentences, which Guu et al. (2020) hypothesize helps the model “focus on problems that require world knowledge”. We experimented with using the same SSM data and objective to continue pretraining the T5 checkpoints for 100,000 additional steps before fine-tuning for question answering.

With that context in mind, I understand that the sentence in the GPT-3 papers means that in normal language models, the predictions of every token has the same importance weight toward the computation of the loss, as the individual token losses are added together in an unweighted manner. This as opposed to the salient span masking approach, which finds tokens that are important to predict by means of a BERT-based preprocessing.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.