(GPT3) Language Models are Few-Shot Learners Paper Reading

Paper address: https://arxiv.org/pdf/2005.14165v4.pdf

Summary


        Recent work has demonstrated substantial progress on many NLP tasks and benchmarks by pre-training on large text corpora and then fine-tuning on specific tasks . Although generally task-agnostic in architecture, this approach still requires task-specific fine-tuning datasets containing thousands or tens of thousands of examples. In contrast, humans can often perform a new language task with only a few examples or simple instructions —something that current NLP systems still largely struggle to do. Here we show that augmenting language models dramatically improves task-independent, few-shot performance, sometimes even competing with previous state-of-the-art fine-tuning methods . Specifically, we trained GPT-3, an autoregressive language model with 175 billion parameters, 10 times more than any previous non-sparse language model , and trained on multi-sample (few- shot setting) to test its performance. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, and tasks and few-shot demonstrations are specified entirely through textual interactions with the model . GPT-3 achieves strong performance on many NLP datasets, including translation, question answering, and cloze tasks, and some tasks that require immediate reasoning or domain adaptation, such as unscrambling words, in sentences, or performing 3-digit arithmetic. At the same time, we also identify some datasets where multi-shot learning for GPT-3 remains difficult, and some datasets where GPT-3 faces methodological problems associated with training on large web corpora. Finally, we show that GPT-3 can generate samples of news articles that human raters have difficulty distinguishing from articles written by humans. We discuss this finding and the broader societal implications of GPT-3 in general.

1 Introduction

        In recent years, there has been a trend in NLP systems to pre-train language representations in NLP systems for downstream transfer in an increasingly flexible and task-independent manner. First, a single-layer representation is learned using word embeddings and fed to a task-specific architecture, then an RNN with multiple layers of representation and contextual state is used to form a stronger representation (albeit still suitable for a task-specific architecture), recently predicting Trained recurrent or transformer language models have been directly fine-tuned, completely eliminating the need for task-specific architectures.
        The last paradigm has made substantial progress on many challenging NLP tasks, such as reading comprehension, question answering, textual entailment, etc., and continues to make progress on the basis of new architectures and algorithms. However, a major limitation of this approach is that although the architecture is task-independent , it still requires task-specific datasets and task-specific fine-tuning: achieving strong performance on a desired task often requires fine-tuning Datasets of thousands to hundreds of thousands of examples. Removing this restriction is desirable for several reasons .
        First, from a practical point of view, each new task requires a large dataset of labeled examples, which limits the applicability of language models . There is a very wide range of potentially useful language tasks, ranging from correcting grammar to generating examples of abstract concepts to reviewing short stories. For many of these tasks, it is difficult to collect large supervised training datasets, especially when the process must be repeated for each new task.
        Second, as the expressiveness of the model and the training distribution shrink, the potential to exploit spurious correlations in the training data increases fundamentally . This can pose a problem for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but then fine-tuned on a very narrow distribution of tasks . For example, observe that larger models do not necessarily generalize to better distributions. There is evidence that the generalization achieved under this paradigm may be poor,Because the model is too specific to the training distribution and does not generalize well outside of it . Thus, the performance of a fine-tuned model on a particular benchmark, even if it is nominally human-level, may overstate the actual performance of the underlying task.

        Third, humans don't need large supervised datasets to learn most language tasks —short instructions in natural language (such as "tell me whether this sentence describes happiness or sadness" ) or at most a small number of examples ( demonstrations ) (such as "Here are two examples of people who acted bravely; give a third example of bravery") is usually enough for a person to perform a new task with at least reasonable ability. In addition to pointing out the conceptual limitations of our current NLP techniques, this adaptability has practical advantages— it allows humans to seamlessly mix together or switch between many tasks and skills, such as performing addition in lengthy conversations. For widespread use, we hope to one day have the same fluidity and generality of our NLP systems.

Figure 1.1: Language model meta-learning. During unsupervised pre-training, language models develop a wide range of skills and pattern recognition capabilities. It then uses these capabilities at inference time to quickly adapt or identify desired tasks. We use the term "in-context learning" to describe the inner loop of this process, which occurs on each sequence's forward pass. The sequences in this figure are not meant to represent the data the model will see during pre-training, but to show that sometimes repeated subtasks are embedded within a single sequence.


        One potential avenue to address these problems is meta-learning—in the context of language models, this means that the model develops a wide range of skills and pattern recognition capabilities at training time, and then uses these capabilities at inference time to quickly Adapt or identify the required tasks (as shown in Figure 1.1). Recent work attempts to do this through what we call "contextual learning", using textual input to a pretrained language model as a form of task specification : the model takes natural language instructions and/or examples of some tasks, and then only More instances of the task can be completed by predicting what will happen next.
        While it has shown some initial promise, the method still achieves far less fine-tuned results - e.g. only 4% on Natural questions, and even its 55 F1 CoQa result is now 35 points behind The most advanced of the above. Meta-learning clearly needs a lot of improvement before it can become a practical method for solving language tasks.
        Another recent trend in language modeling may offer a way forward. In recent years, the capacity of the Transformer language model has increased significantly, from 100 million parameters [RNSS18], 300 million parameters [DCLT18], 1.5 billion parameters [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters billion parameters [RSR+19] and finally 17 billion parameters [Tur20]. Each increase brings improvements in text synthesis and/or downstream NLP tasks, and there is evidence that the log loss, which is closely related to many downstream tasks, follows a trend of smooth improvement with scale [KMH+20]. Since in-context learning involves assimilating many skills and tasks within model parameters, in-context learning capabilities are likely to show a similarly strong increase with scale.

        In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model (which we call GPT-3) and measuring its contextual learning ability . Specifically, we evaluate GPT-3 on more than 20 NLP datasets as well as on several new tasks designed to test fast adaptation to tasks that are unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) "few-shot learning", or contextual learning, where we allow as many examples as possible to fit within the model's context window (typically 10 to 100), (b) “one-shot learning”, where we only allow one example, and (c) “zero-shot” learning, where no examples are allowed and the model is given only one natural language instruction. GPT-3 can in principle also be evaluated in the traditional fine-tuning setting, but we leave this for future work.

Figure 1.2: The larger the model, the more efficiently the contextual information is used. We demonstrate contextual learning performance on a simple task that requires the model to remove random symbols from words, with or without natural language task descriptions (see Section 3.9.2). A steeper "contextual learning curve" for large models indicates an improved ability to learn tasks from contextual information. We see qualitatively similar behavior across a wide range of tasks


        Figure 1.2 illustrates the conditions of our study and shows few-shot learning of a simple task that requires the model to remove extraneous symbols from words . Model performance is improved by adding a natural language task description and the number K of examples in the model context. Few-shot learning also improves significantly with model size. Although the results in this case are particularly compelling, the general trends in model size and number of examples in context apply to most tasks we study. We emphasize that these "learning" curves do not involve gradient updates or fine-tuning, just increasing the number of demonstrations given as condition.
        Broadly speaking, on NLP tasks, GPT-3 achieves encouraging results in zero-shot and one-shot settings, and in few-shot settings, sometimes even matching the state-of-the-art, and even occasionally outperforms the state-of-the-art (though the state-of-the-art is a fine-tuned model). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, and 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on the zero-shot setting, 68.0% on the one-shot setting, and 71.2% on the few-shot setting on TriviaQA, the last being a state-of-the-art relative fine-tuning model run on the same closed-shot setting .
        GPT-3 also demonstrated one-shot and few-shot proficiency on tasks designed to test rapid adaptation or just-in-time reasoning, which includes unscrambling words, performing arithmetic, and using new words in sentences after seeing them defined only once . We also show that in the few-shot setting, GPT-3 can generate synthetic news articles that are difficult for human evaluators to distinguish from human-generated articles.
        At the same time, we also found some tasks where the multi-instance performance struggled even at the scale of GPT-3. This includes natural language inference tasks, such as the ANLI dataset, and some reading comprehension datasets, such as RACE or QuAC. By providing a broad description of GPT-3's strengths and weaknesses, including these limitations, we hope to stimulate research into (multi-instance) few-shot learning in language models and draw attention to where progress is most needed.

Figure 1.3: Combined performance of all 42 accuracy-based benchmarks While zero-instance performance improves steadily with model size, multi-instance performance increases faster, suggesting that larger models are better at contextual learning. See Figure 3.8 for a more detailed analysis of SuperGLUE, a standard NLP benchmark suite.


        The heuristic significance of the overall results can be seen in Figure 1.3, which aggregates the various tasks (although it should not be considered a strict or meaningful benchmark by itself) 

        We also conducted a systematic study of "data pollution" - a growing problem when training high-volume models on datasets such as Common Crawl, which may contain content from the test dataset, since these are often on the web. In this paper, we develop systematic tools to measure data pollution and quantify its distorting effects. Although we found that data pollution had little impact on the performance of GPT-3 on most datasets, we did identify some datasets where the results may be inflated, and we either do not report results for these datasets or mark them with an asterisk , depending on severity.
        In addition to all of the above, we also train a series of smaller models (from 125 million parameters to 13 billion parameters) in order to compare their performance with GPT-3 in the zero, one-shot, and few-shot settings. Broadly speaking, for most tasks, we find that the scaling of model capacity is relatively smooth in all three settings; a notable pattern is that the gap between zero-, one-, and few-shot performance generally increases with model capacity. increases with increasing , which may indicate that larger models are more proficient meta-learners.
        Finally, given the broad capabilities exhibited by GPT-3, we discuss concerns about bias, fairness, and broader societal impact, and attempt to characterize GPT-3 in this regard initially.
        The remainder of this paper is organized as follows. In Section 2, we describe our approach and methodology for training GPT-3 and evaluating it. Section 3 presents results for all tasks in the zero, one, and few-shot settings. Section 4 addresses the data pollution problem (train-test overlap). Section 5 discusses the limitations of GPT-3. Section 6 discusses wider implications. Section 7 reviews related work and Section 8 concludes.

2 methods

        Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling of model size, dataset size and diversity, and training length. Our use of contextual learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning in context. We therefore begin this section by clearly defining and contrasting different settings in which we will evaluate GPT-3, or in principle can evaluate GPT-3. These settings can be viewed as a range depending on how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see illustration for Figure 2.1):

Figure 2.1: Zero, one, and few examples, compared to traditional fine-tuning. The panel above shows four ways to perform tasks using language models - fine-tuning is the traditional approach, while the zero, one-shot, and few-shots we study in this work require the model to perform the task only on a forward pass at test time. We typically show the model dozens of examples in a few-shot setting. Exact wording for all task descriptions, examples, and hints can be found in Appendix G.

  • Fine-tuning (FT), the most common approach in recent years, involves updating the weights of a pretrained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is its excellent performance on many benchmarks. The main disadvantages are the need for a new large dataset for each task, the possibility of poor generalization distribution [MPL19], and the possibility of exploiting spurious features of the training data [GSL+18, NK19], which may lead to differences with human performance. Make an unfair comparison. In this work, we did not fine-tune GPT-3 because our focus is on task-independent performance, but it is in principle possible to fine-tune GPT-3 and is a promising direction for future work.
  • Few-Shot (FS) is the term we will use in this work and refers to the setting where the model is conditioned on some demonstration of the task at inference time [RWC+19] but not allowed to update the weights. As shown in Figure 2.1, for a typical dataset, an example has a context and desired completion (e.g., an English sentence and a French translation), and by giving K examples of context and completion, followed by a final example context, Models are expected to deliver complete. We usually set K in the range of 10 to 100, since this is how many examples fit in the model's context window (nctx = 2048). The main advantage of few-shot is that it greatly reduces the need for task-specific data and reduces the possibility of learning too narrow a distribution from a large but narrow fine-tuning dataset. The main downside is that this approach results in much worse results than state-of-the-art fine-tuned models so far. Furthermore, a small amount of task-specific data is still required. As the name suggests, the few-shot learning described here for language models is related to the few-shot learning used in other contexts in ML [HYC01, VBL+16]—both involve learning over a broad distribution of tasks (implicit in this case in the pre-trained data) and then quickly adapt to new tasks.
  • One-Shot (1S) is the same as few-shot, except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. What distinguishes one-shot from few-shot zero-shot (below) is that it comes closest to communicating with humans for certain tasks. For example, the task is often demonstrated when humans are asked to generate datasets on human services such as Mechanical Turk. In contrast, it is sometimes difficult to communicate the content or format of a task if no examples are given.
  • Zero-Shot (0S) is the same as one-shot, except that no demonstration is allowed, and the model is only given a natural language instruction describing the task. This approach offers the greatest potential for convenience, robustness, and avoidance of spurious correlations (unless they occur very widely in large corpora of pretrained data), but is also the most challenging setting. In some cases, humans may even have difficulty understanding the format of tasks without precedent, so this setting is "unfair" in some cases. For example, if someone is asked to "make a world record form for the 200-meter sprint", the request may be ambiguous because it may not be clear what format the form should have or what it should include (even with careful clarification, understanding exactly what is required can be difficult). Still, at least for some settings, zero-shot is the closest to how humans perform tasks—for example, in the translation example in Figure 2.1, humans might know what to do based on textual instructions alone.

        Figure 2.1 shows four approaches, taking the example of translating English to French. In this paper, we focus on zero-shot, one-shot, and few-shot, with the goal of comparing them not as competing choices, but as different problem settings that provide differentiating between performance and sample efficiency for a particular benchmark. trade-off. We particularly emphasize few-shot results, as many of them lag only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one, and sometimes zero, seems to be the fairest comparison to human performance and is an important target for future work.
        Sections 2.1-2.3 below detail our model, training data, and training process, respectively. Section 2.4 discusses the details of how we perform few-shot, one-shot, and zero-shot evaluations.

Table 2.1: Sizes, architectures and learned hyperparameters (batch size of tokens and learning rate) of our trained models. All models were trained on a total of 300 billion tokens.

 2.1 Model and Architecture

Guess you like

Origin blog.csdn.net/keeppractice/article/details/130656775