[Paper Reading] Language Models are Few-Shot Learners (GPT-3)

foreword

This article briefly introduces the background, model architecture, training data and training methods of GPT-3. There are many specific training details and experimental results, so you can read them when you use them.

Intro

This article analyzes the problems of the pretrain-finetune architecture:

  • For each new task, a large amount of labeled data is required
  • It is unreasonable to train a model with stronger expressive ability (a large model is required in the pre-training phase) on narrower data (the fine-tuning phase is performed on the narrow data distribution). The effect of large models cannot be generalized to OOD data
  • Humans do not need a large number of training samples when they are exposed to a downstream task, they only need a description of the task or a few examples. We hope that the NLP model can also have the ability to seamlessly connect between multi-tasks

Possible solutions to the above problems:

  • meta-learning: The model has learned a series of methods and has a series of capabilities during the pre-training phase. During the prediction stage, we leverage this ability to quickly adapt to downstream tasks.

    • Someone has already done this with in-context learning, but the effect is not good

    insert image description here

  • LLM: Every increase in the parameters of the Transformer language model will improve the performance of text understanding and other NLP downstream tasks, and there is evidence that the log loss function maintains a stable trend after the model size increases. We believe that the ability of in-context learning will also increase with the increase of model parameters

We trained a 175B model GPT-3 and tested the performance of GPT-3 under 3 settings:

  • few-shot learning(in-context learning): Allow a few samples (typically 10 to 100) to appear in the model input
  • one-shot learning: only one example is allowed
  • zero-shot learning: No samples are allowed, only a natural language instruction is provided

The figure below shows 移除单词中多余符号任务the performance of the model above

insert image description here

  • GPT-3 can achieve good results in zero-shot and one-shot settings, and can sometimes match or even exceed the fine-tuned SOTA model in few-shot settings
  • GPT-3 in both zero-shot and one-shot settings can achieve superior performance in fast adaptation and instant reasoning tasks (word sorting, algebraic operations, and using words that only appear once).
  • Under the few-shot setting, GPT-3 can generate press releases that are difficult for humans to distinguish
  • Under the few-shot setting, the performance of GPT-3 on some natural language reasoning tasks (ANLI dataset) and reading comprehension (RACE, QuAC) needs to be improved
  • The overall performance on different benchmarks is shown in the figure below

insert image description here

We also trained some small models (from 125 million to 13 billion) for comparison with GPT-3. For most tasks, model performance increases relatively smoothly with size under the 3 settings. However, as the model capacity increases, few-shot has a larger lead than one, and zero-shot.This shows that large models may be more suitable as meta-learners (larger models are more proficient meta-learners)

Approach

The pre-training method of this article is similar to GPT-2, except that a larger model, data volume, diversity and training time are used, and the in-context learning method is also similar. However, this paper systematically analyzes the impact of different settings on learning with context, which can be seen as the degree of dependence on task-related data.

  • Fine-tuning: This article does not train the fine-tuned version of GPT-3, because the main focus is on task-agnostic performance
  • Few-shot: Some samples are provided during the prediction phase, but no parameter updates are performed. The number of samples is 10 to 100 (the number of samples that can be accommodated in the window size)
  • One-shot: only one sample is provided
  • Zero-shot: No samples are provided, only a natural language instruction describing the task

The picture below is a display of the input form under different settings for the task of translating English into French

insert image description here

The different settings in this article are not intended to be compared with each other or to replace each other. Instead, on specific benchmarks, different problem formulations that provide trade-offs between performance and sampling efficiency.

Model and Architectures

The model structure, initialization method, pre-normalization method, and tokenize method are the same as GPT-2, but the transformer uses a similar attention mode to the Sparse Transformer, and the different model parameter settings are shown in the table below

insert image description here

  • The context window size of all models is 2048 tokens

Training Dataset

The Common Crawl dataset contains nearly a trillion words, and one pass through the dataset is enough to train our largest models.

  • However, the quality of the data set without data cleaning is not high, and the following three steps are used to clean the data
    • Download a version of the dataset, filtered out based on similarity to a range of high-quality reference corpora
    • Fuzzy deduplication was performed at the document level, within datasets, and across datasets to prevent redundancy and maintain the integrity of our validation set as an accurate measure of overfitting.
    • Adding a known high-quality reference corpus to the training mix to enhance Common Crawl and increase its diversity

The ratio of the training data used is shown in the table below

insert image description here

  • The data is not sampled proportionally during training, and high-quality data sets will be sampled more times
  • CommonCrawl and Books2 are sampled less than once, other datasets are sampled 2-3 times

Training Process

  • Studies have shown that larger models usually use larger batch sizes, but require smaller learning rates. This article evaluates the size of the gradient noise during training to select the batch size

  • Distributed training using matrix multiplication and parallelism of different layers of the network

  • Training on V100

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/131135947