Training Objective of language model for GPT3
On page 34 of OpenAI's GPT-3, there is a sentence demonstrating the limitation of objective function:
Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important.
I am not sure if I understand this correctly. In my understanding, the objective function is to maximize the log-likelihood of the token to predict given the current context, i.e., $\max L \sim \sum_{i} \log P(x_{i} | x_{i})$. Although we aim to predict every token that appears in the training sentence, the tokens have a certain distribution based on the appearance in human litterature, and therefore we do not actually assign equal weight to every token in loss optimization.
And what should be an example for a model to get the notion of what is important and what is not. What is the importance refer to in here? For example, does it mean that the is less important compared to a less common noun, or does it mean that the current task we are interested in is more important than the scenario we are not interested in ?
Any idea how to understand the sentence by OpenAI?
Topic openai-gpt language-model nlp
Category Data Science