Pre-training of large language models [6]: Detailed explanation of the definition principle of Chain-of-thought (CoT), Zero-shot CoT, Few-shot CoT and application on LLM

Pre-training of large language models [6]: Detailed explanation of the definition principle of Chain-of-thought (CoT), Zero-shot CoT, Few-shot CoT and application on LLM

1. Definition of chain of thought

  • background

Between 2017 and 2019, with the introduction of the Transformer model, computing resources and large-scale corpora continue to emerge, and earth-shaking changes have taken place in the field of natural language processing. The traditional fully supervised learning paradigm has gradually reached a bottleneck, and it is difficult to achieve substantial improvements in traditional training methods. At this time, the emergence of large-scale pre-training models such as Bert and RoBERTa has turned the research direction to the paradigm of pre-training model-based + downstream task Fine-tune.

However, as the scale of the language model continues to increase, the cost of Fine-tune becomes higher and higher. Taking GPT-3 as an example, its parameter volume has reached an astonishing 175B. For such a large-scale parameter, it is difficult to effectively migrate the model only by relying on traditional Fine-Tune, and such a large-scale parameter volume makes the cost of gradient backpropagation also increase sharply. In this context, prompt learning emerges as the times require. Hint learning makes the input and output of the target task more closely match the data of the original language model training by transforming downstream tasks and adding expert knowledge.

In 2021, prompt learning has gone through multiple stages starting with discrete prompt learning (combination of prompt words) and continuous prompt learning (continuous spatial representation) as the revival, and gradually reached a climax. However, hint learning based on continuous space also has many limitations, such as resource consumption and training instability. During this period, although most researchers generally agreed that hint learning would bring about the next generation of revolution in the field of natural language processing, most of the research work during this period was mainly related to model training or new language model structures.

Until 2022, the effect of large-scale language models will become better "visible to the naked eye". At the same time, as the size of the model continues to increase, the model will also become better "hints", especially some breakthroughs have been made in tasks that were not able to do well before. But large models are not good enough for arithmetic reasoning, commonsense reasoning, and symbolic reasoning. The in-context few shot capability of the large model is extremely strong, but it is very time-consuming to create many intermediate steps to supervise finetune, and the traditional prompt method is not good in mathematical calculations and common sense reasoning. How to combine in-context few shot and intermediate steps to improve arithmetic reasoning, common sense reasoning, and symbolic reasoning is a problem. A series of work of the thinking chain was born under such a big environment.

  • definition

The concept of Chain-of-thought (CoT) was first proposed in Google's paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". Chain of Thought (CoT) is an improved hinting strategy for improving the performance of LLM in complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning and symbolic reasoning . Instead of simply constructing hints from input-output pairs like ICL, CoT incorporates intermediate reasoning steps that lead to final outputs into hints. To put it simply, the thinking chain is a kind of discrete prompt learning . More specifically, the context learning under the large model (that is, adding examples to the front of the current sample input without training, and letting the model input these texts for output to complete the task x1​,y1​,x2​,y2​,....xtest​) ytest​.

It can be seen that for similar arithmetic problems, the thinking chain prompt will automatically give the reasoning steps before giving the answer:

"Roger first has 5 balls, 2 cans and 3 tennis balls equals 6, 5 + 6 = 11" "There were 23 apples in the
cafeteria, 20 were used for lunch, 23-20=3; 6 more apples were bought, 3+6=9"

The chain of thinking prompts gave the correct answer, but the traditional prompt learning that directly gave the answer turned out to be wrong, and even very basic mathematical calculations could not be done well. Simply put, it is difficult for a language model to directly convert all semantics into an equation, because this is a more complex thinking process, but it can better reason about each part of the problem through intermediate steps.

An effective chain of thought should have the following characteristics:

  • Logical : Each thinking step in the thinking chain should have a logical relationship, and they should be connected to each other to form a complete thinking process.
  • Comprehensiveness : The chain of thought should consider the problem as comprehensively and meticulously as possible to ensure that no possible factors and influences are overlooked.
  • Feasibility : Every thought step in the thought chain should be feasible, that is, they should be practical and implementable.
  • Verifiability : Every thought step in the chain of thought should be verifiable, that is, they should be verifiable for correctness and validity through actual data and facts.

2. The method of thinking chain for context learning (In-context learning)

2.1 Few-shot CoT

Few-shot CoT is a special case of ICL that augments each demonstration 〈input, output〉 to 〈input, CoT, output〉 by fusing CoT inference steps .

  • 【CoT prompt 的设计】

    • As a straightforward approach, it is shown that using different CoTs (i.e., multiple inference paths per question) can effectively improve their performance.
    • Another intuitive idea is that cues with more complex reasoning paths are more likely to elicit the reasoning ability of the LLM, which can lead to higher accuracy in generating the correct answer.
      However, both methods rely on annotated CoT datasets, which limits their application in practice. To overcome this limitation, Auto-CoT proposes to leverage Zero-shot-CoT, which eliminates manual operations by specifically hinting LLM to generate CoT inference paths. To improve performance, Auto-CoT further divides the questions in the training set into different clusters, and then selects the question closest to the center of each cluster, which should well represent the questions in the training set. Although few-shot CoT can be viewed as a special case of cues for ICL, the order of presentations seems to have relatively little impact compared to standard cues in ICL: reordering presentations only leads to less than 2% performance change in most tasks.
  • 【增强的 CoT 策略】
    In addition to enriching contextual information, CoT hints provide more options to infer the answer to a given question. Existing research mainly focuses on generating multiple reasoning paths and trying to find consensus among the derived answers. For example, self-consistency is proposed as a new decoding strategy when generating CoT and final answer. It first generates several inference paths, and then synthesizes all answers (e.g., by voting among these paths to select the most consistent answer). Self-consistency improves the performance of CoT inference to a large extent, and even improves some tasks where CoT hints are usually worse than standard hints. Furthermore, extending the self-consistent strategy to a more general ensemble framework (extending to ensembles on hints), discovering different inference paths is the key to improving CoT inference performance.

2.2 Zero-shot CoT

Unlike Few-shot CoT, Zero-shot CoT does not include human-annotated task demonstrations in the prompt . Instead, it directly generates inference steps and then uses the generated CoTs to derive answers. Among them, LLM first generates reasoning steps from the "Let's think step by step" prompt, and then derives the final answer from the "Therefore, the answer is" prompt. They found that this strategy greatly improves performance when the model size exceeds a certain size, but is ineffective for small-scale models, showing a significant pattern of emergent power.
To unlock CoT capabilities on more tasks, Flan-T5 and Flan-PaLM further perform instruction tuning on CoT annotations and improve zero-shot performance on unseen tasks.

3. Conclusion

  • CoT has little effect on small models, and the effect is only when the model parameters reach at least 10B, and the effect is obvious when the model parameters reach 100B . And, as can be seen from the output of the small models, most of them output fluent but illogical CoT, thus getting wrong results.

  • The performance gain of CoT is larger for complex problems , such as GPT-3 and PaLM more than doubling the performance on GSM8K (harder, because the baseline is the lowest). While for MAWPS-SingleOp (simpler task), the performance improvement is very small or even negative.

  • PaLM 540B with CoT exceeds the state-of-the-art results for task-specific models trained with supervised learning . Without CoT, the results of LLM on GSM8K and MAWPS tasks are not comparable to the optimal supervised learning model.

A thought chain is a typical series of steps followed by a human thought process when solving a reasoning task. It can help us decompose a problem into a series of sub-problems, and then solve these sub-problems one by one to get the final answer. In large language models, chains of thought can be used to elicit reasoning. The chain of thought approach brings the following benefits:

  • CoT allows the model to decompose multi-step reasoning problems into intermediate steps, which means that additional computation can be allocated to complex problems requiring reasoning;
  • CoT makes large language models more interpretable, more trustworthy, and provides an opportunity to debug errors in reasoning paths;
  • CoT reasoning can be used for tasks such as mathematical word problems, common sense reasoning and symbol manipulation, and may be applicable to any problem that humans need to solve through language;
  • CoT can induce reasoning capabilities in sufficiently large language models by incorporating it into few-shot prompting examples.

The current thinking chain also has many limitations:

  • First of all, although the designed thinking chain is simulating the human reasoning process, whether the model has really learned to reason still needs further verification.
  • Artificially designing the thinking chain is still too expensive, and large-scale manual marking of the thinking chain is not feasible.
  • Chain of thought is only valid on large-scale models (above 10B)

4. Thinking about the chain of thought in the future

  • (1) When CoT is useful for LLMs

Since CoT is an emergent capability, it only has a positive impact on sufficiently large models (eg, typically containing 10B or more parameters), but has no effect on small models. Moreover, since CoT augments standard cues with intermediate reasoning steps, it is mainly effective in improving tasks that require step-by-step reasoning, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. However, it may show worse performance than standard hints for other tasks that do not rely on complex reasoning, such as GLUE's MNLI-m/mm, SST-2, and QQP.

  • (2) Why LLMs can perform CoT reasoning

Regarding the origin of the CoT capability, it is generally assumed that it can be attributed to training on code, as models trained on code show strong reasoning capabilities. Intuitively, the code data is well-organized by algorithmic logic and programming flow, which may help improve the inference performance of LLM. However, this hypothesis still lacks publicly reported evidence from ablation experiments. Furthermore, instruction tuning does not appear to be the key reason for achieving CoT capabilities, as experience shows that instruction tuning on non-CoT data does not improve performance on the held CoT baseline.

In conclusion, CoT cues provide a general and flexible method for inducing the reasoning ability of LLMs. There are also some initial attempts to extend the technique to solve multimodal tasks and multilingual tasks. In addition to using LLM directly with ICL and CoT, some recent studies have also explored how to specialize the capabilities of LLM to specific tasks, which is called model specialization. For example, the researchers specifically studied the mathematical reasoning ability of LLMs by fine-tuning a small-scale Flan-T5 on the CoT reasoning path generated by LLMs. Model specialization can also be used to solve various tasks such as question answering, code synthesis, and information retrieval.

5. Key knowledge points

  1. The characteristics that an effective chain of thought should have are: logic, comprehensiveness, and feasibility

  2. Chains of thought can only work in large language models.

  3. Few-shot CoT is a special case of ICL.

  4. Zero-shot CoT does not include human-labeled task demonstrations in the prompt.

  5. CoT makes large language models more interpretable and more believable.

insert image description here

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131824482