Complex Reasoning: The "North Star" Capabilities of Large Language Models

8b9a70fa6624e79626267630eac333b9.png

(In astrophotography, when long exposures are used to photograph star trails, the Polaris is at the center of the trail, always pointing toward true north. In ancient times, it guided travelers.)


Author | Fu Yao

PhD student at the University of Edinburgh

Recently, a lot of research on smaller models has achieved exciting dialogue capabilities, which leads people to imagine whether smaller models can have comparable performance to large models like GPT-3.5. In general, language models have multidimensional capabilities, so it is difficult to compare models with each other. Finding the right metric is critical to developing a robust language model. At the current stage, researchers are eager to know what are the key factors to measure the potential of large language models.

In a blog post at the time of GPT-4's release, the authors wrote: "In a casual conversation, the difference between GPT-3.5 and GPT-4 can be subtle. When the complexity of the task reaches a sufficient threshold, the difference will show up.” This means that complex tasks are likely to be the key differentiator between large and small language models.

More importantly, complex inference provides an opportunity to build a large number of applications based on language models, so that language models have the opportunity to become the next generation computing platform/operating system. This has the potential to fundamentally change the way humans interact with machines and reshape the entire computing ecosystem.

In this article, we will carefully analyze and discuss how to make large language models have powerful complex reasoning capabilities.

1

Motivation: Large Language Models as a Next-Generation Computing Platform

We study complex reasoning for two reasons:

  • As mentioned above, complex reasoning is the key factor that marks the difference between small models and large models, which was discussed in the GPT-4 release paper.

  • Complex reasoning is a core capability that makes models a next-generation operating system.

The vision of language models as a next-generation operating system is particularly interesting, as it opens up countless possibilities for building new applications and creating a language model-based computing ecosystem that may offer even greater opportunities than super-apps such as ChatGPT. Complex reasoning capabilities are fundamental, because if we want a model to be the new operating system, it needs to be able to complete complex instructions by interacting with all elements of the tool, the user, and the external environment.

This paper examines how to train models with powerful complex reasoning capabilities, how to engineer hints to fully exploit the model's reasoning capabilities, and how to evaluate the model's reasoning performance. The content of this article is divided into the following parts:

  • In Section 2, we discuss existing methods for building language models with powerful complex reasoning capabilities. The scheme for complex reasoning is similar to that for general large language model (LLM) development, consisting of three stages: continue training, instruction finetuning, and reinforcement learning. We further discuss the surprising coupling between code and reasoning.

  • In Section 3, we discussed prompt engineering techniques for complex reasoning. When the language model becomes a new generation of operating system kernel, hint engineering/scenario learning will become a new generation of script programming (shell script).

  • In Section 4, we discussed how to evaluate the inference ability of large language models. We introduce the Chain-of-thought Hub, a dataset of more than 100 inference tasks that clearly marks the differences between large and small models. We highlight the excellent performance of the LLaMA 65B. We think it has very strong potential as a base model to reproduce ChatGPT-3.5.

2

A solution to increase the reasoning ability of large language models

The scheme for reasoning is closely related to the scheme for building general large-scale language models and chatbots. There are three stages in total:

  • Pre-training/Continuous training: In this phase, we usually train large models on large datasets such as scientific literature or code data.

  • Supervised fine-tuning: In this stage, we fine-tune the model so that it fulfills the instructions of the complex task.

  • Reinforcement Learning: In this phase, we use signals such as whether the task is fully/partially completed or not as rewards.

We further review the hypothesis that training on code also improves model inference. Therefore, in our literature analysis, we consider both reasoning and encoding. As we shall see, there is a surprising correlation between the two in terms of learning methods.

2.1 Pre-training and continuous training

We analyze the following studies:

  • Lewkowycz et. al. 2022. Minerva: Solving Quantitative Reasoning Problems with Language Models

    • Continue training PaLM 540B on tokens 38.5B from the Arxiv paper.

    • On MATH (a difficult dataset requiring questions answered in LaTeX format), it scored 33.6 (GPT-4 scored 42.5)

  • Taylor et. al. 2022. Galactica: A Large Language Model for Science

    • Pre-train a 120B language model on 106B tokens including papers, code, reference materials, knowledge base and other content.

    • Performance on MATH 20.4 (Minerva 33.6, GPT-4 42.5)

  • Chen et. al. 2021. Codex: Evaluating Large Language Models Trained on Code

    • Continue to train the 12B GPT-3 model on the 159GB code data, improving the code performance on the HumanEval dataset.

These studies found that training on a large amount of scientific literature/code can significantly improve the reasoning/encoding ability of the underlying model.

2.2 Supervised fine-tuning

We analyze:

  • Chung et. al. 2022. Scaling Instruction-Finetuned Language Models

    • Using diverse instructions significantly improves the model's ability to generalize to zero samples

    • Mixing thought chaining data in the instruction set (discussed further in the flan collection article) significantly improved the thought chaining capabilities of the model

    • Note: Although the flan collection dataset motivates the underlying model's capabilities in multiple dimensions, these instructions do not come from real chatbot user interactions and thus may not directly translate to better chat performance.

  • Fu et. al. 2023. Specializing Smaller Language Models towards Multi-Step Reasoning

    • Refining chain of thought reasoning capabilities to smaller-scale (less than or equal to 10B) models. Typically, 10B-scale models are well suited for deployment (bigger models are too expensive, smaller models are too weak).

    • This article discusses many engineering details such as data engineering, power balance, and the difference between small and large models

  • Li et. al. 2022. Competition-Level Code Generation with AlphaCode

    • Pre-train a 41B model on 715GB of GitHub code, then fine-tune on the CodeContest dataset containing 13k questions

    • During testing, use sampling to filter solutions based on whether or not they pass the sample test. In a sense, this approach is similar to the self-consistency approach in inference problems.

The current understanding of instruction fine-tuning is:

  • By using data in a conversational format, it is relatively easy to fine-tune a basic model into a chatbot (see great examples like Alpaca and MOSS). However, the ability to make small talk does not translate into the ability to perform complex tasks. From this point of view, models are like humans: more talk than good work, and the code sees the truth.

  • In fact, the instruction tuning problem is a data mixing problem: how best to mix instruction data from different sources in order to improve model performance uniformly from all perspectives (instead of, as discussed in CoT specialization and the flan collection, increase one dimension but decrease another).

  • A simple starting point for data blending is: use 10-20 data points that are not chained (to balance ability across dimensions), but use as much chained data as possible (to maximize reasoning power).


2.3 Reinforcement Learning

We analyze:

  • Uesato. et. al. 2022. Solving math word problems with process- and outcome-based feedback

    • Build reward models based on intermediate reasoning and final reasoning results.

  • Le et. al. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

    • Train reward models based on signals such as compilation errors, runtime errors, or passing tests.

Both jobs use intermediate signals (for inference, whether the intermediate steps are correct; for encoding, whether the code compiles) and final signals (for inference, whether the final answer is correct; for encoding, whether the code passes a test) as rewards. It is important to note that this type of reinforcement learning differs from human feedback-based reinforcement learning (RLHF) because it does not require human feedback.

2.4 Coupling of Reasoning Ability and Code Ability

In our previous discussion, we hypothesized that training on code might improve inference for the following reasons:

4873603df581f88af6c4716fe37b5044.png

  • Code comments are naturally occurring chain thinking data

  • Process-oriented programming is similar to solving tasks step by step. This works for both simple and medium complexity tasks

  • Object-oriented programming is akin to breaking down a task into smaller tasks and solving them individually. This is suitable for tasks of higher complexity.

From this remarkable agreement, we see that improving reasoning ability is very similar to improving programming ability. Here, we deepen this assumption by emphasizing the recipe similarities for training large language models for inference or encoding:

f5a00ac03f35cbdd1558268dba026315.png

We see both inference and code go through:

  • During the continuous training phase, the base model can be augmented with code and scientific literature data.

  • In the supervised fine-tuning stage, the model can be fine-tuned based on instructions or code written to complete complex tasks.

  • In the reinforcement learning stage, intermediate inference steps/compilation rate and final inference result/code passing rate are rewarded.

  • During decoding, both inference and encoding sample multiple solutions and then choose the best solution from the decoded space.

These similarities make the connection between code and reasoning very interesting.

3

Hint engineering for complex reasoning

After discussing how to build models with strong reasoning capabilities. In this section, we discuss how to effectively hint the model to fully unleash the model's potential.

3.1 Basic thinking chain prompt project

The following papers are recommended for beginners:

  • Wei et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

    • This paper is the first to find that when prompted using chain thinking, there is a phase transition that shows that large models largely outperform smaller ones, which further led to the discovery of emergent power.

  • Wang et. al. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models

    • Majority voting on sampled CoT inference paths significantly improves inference performance.

  • Suzgun et. al. 2022. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    • Use CoT for difficult tasks in big-bench. An interesting by-product of this paper is the BigBench Hard dataset, which is very effective at testing model reasoning capabilities.


3.2 Advanced skills and analysis

The following papers discuss advanced CoT prompting practices:

  • Fu et. al. 2023. Complexity-Based Prompting for Multi-Step Reasoning

    • Using complex chains instead of simple chains for contextual examples

  • Khot et. al. 2023. Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    • Break down complex tasks into simpler tasks and tackle them one by one

Usually, for a complex task, first break it down into simpler tasks, and then solve the simpler tasks step by step.

The following papers discuss why contextual learning works:

  • Xie et. al. 2021. An Explanation of In-context Learning as Implicit Bayesian Inference

    • The language model infers a potential concept between the examples in the prompt and enters the corresponding task mode

  • Wei et. al. 2023. Larger language models do in-context learning differently

    • When contextual examples contradicting prior knowledge appear, large models can override semantic priors based on cue words, although they may have stronger semantic priors.

In a nutshell, the gist of contextual learning is that the examples in the hint make the model enter the corresponding task mode, and then perform the task.

The following papers discuss the behavior of the model when doing thought chain reasoning:

  • Min et. al. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    • When some labels are wrong, the model can still make correct predictions. This suggests that the model is more influenced by the [format] of the prompt than by the [meaning] of the prompt.

  • Wang et. al. 2022. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

    • Even if the inferences in the hints are wrong, the model can still infer correctly, but the relevance of the hints and the order of the inference steps are more important—again, the model is more influenced by the [format] of the hint than the [meaning] of the hint.

  • Madaan and Yazdanbakhsh. 2022. Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango.

    • Detailed analysis shows that the format of the prompt can improve CoT reasoning (although the correctness of the content may not play a strong role)

In short, the model only pays attention to the format of the hint, but may not be significantly affected by the correctness of the hint. However, the extent to which the model is influenced by the correctness of the cues, or the extent to which the cues can override the model's prior beliefs, is an open question.

The following papers discuss how to improve model performance through refinement and feedback:

  • Madaan. et. al. 2023. Self-refine: Iterative refinement with self-feedback

    • The model can optimize and improve its own reasoning in multiple scenarios (including code optimization, mathematical reasoning, dialogue response generation, etc.).

  • Madaan et. al. 2023. Learning Performance-Improving Code Edits

    • Training on procedural trajectories improves encoding.

In short, model improvement and feedback in the form of natural language (rather than in the form of rewards in reinforcement learning) is very effective in further improving the performance of language models (whether through contextual learning or fine-tuning).

4

Evaluating the Reasoning Ability of Large Language Models

After discussing methods and tips for training powerful models, we now discuss the evaluation of language model reasoning ability.

4.1 Basic knowledge of evaluation methods

When talking about evaluation, there are three important factors to consider: data format, capability type, and model type. First, there are four data formats when prompted:

7b09c5cd39b30456810c45d63b1bb42e.png

in:

  • In-context refers to attaching a series of contextual examples before the test question.

  • Zero-shot refers to directly feeding test questions into the model without contextual examples.

  • Chain-of-thought refers to generating reasoning before answering.

  • Answer-only means that there is no chain thinking, and the answer is given directly.

For model capabilities, there are two roughly orthogonal types of capabilities:

  • Knowledge: Does the model understand the world

  • Inference: Whether the model can reason based on its knowledge.

These two aspects are not strictly orthogonal, since some rules of inference can also be considered as some form of knowledge. However, when assessed, the two competencies have distinct differences:

  • Some datasets focus more on assessment of knowledge, such as MMLU, which tests whether a model has knowledge up to university level.

  • Some datasets pay more attention to the evaluation of inference, such as BBH, which tests whether the model has the ability to solve the problem step by step.

  • For knowledge, chained thinking performs similarly to answer-only (see FlanPaLM paper)

  • For reasoning, chain thinking performs better than just answering (see the original CoT paper, then the FlanPaLM paper)

In practice, because CoT is at or better than Answer-only performance, and CoT is more user-friendly (because it tells the user the thought process), modern chatbots always deploy CoT (no matter what you ask ChatGPT, it will tell you something heap its thoughts).

Finally, in terms of evaluation, we distinguish two types of models: checkpoint after pre-training and checkpoint after instruction fine-tuning.

  • The pre-training checkpoint has the ability of in-context learning. Most pre-trained models can do in-context answer-only, some better models can do in-context chain-of-thought (but it is not clear why some pre-trained models can do CoT and others can't) . However, pre-trained checkpoints may not be able to do zero-shot because they are not trained to do so (but some pre-trained checkpoints can still do zero-shot CoT, see the "Let's Think Step by Step" paper).

  • The checkpoint after instruction fine-tuning has both zero-shot and in-context capabilities. It should be noted here that if it is not adjusted properly, the in-context performance may decrease slightly after instruction fine-tuning.

In summary, we recommend using an in-context chain-of-thought for evaluation:

  • In-context is a better way to evaluate pretrained checkpoints because it better reveals the model potential. Zero-shot may underestimate model performance, especially for models that do not support Zero-shot chain-of-thought (“let’s think step by step”).

  • Chain-of-thought prompting is a better way to evaluate reasoning ability because it exploits the model's reasoning performance more fully than answer-only prompting.


4.2 Introduction to Chain-of-thought Hub

https://github.com/FranxYao/chain-of-thought-hub

After discussing all the evaluation basics, we introduce the Chain-of-thought Hub, a work in progress that hopes to become a unified platform for evaluating the reasoning ability of language models. We compiled a list of complex reasoning tasks including math (GSM8K), science (MATH), symbol (BBH), knowledge (MMLU), etc. to measure which models are indeed better. Below is the current leaderboard. Although many numbers have not yet come out, the current content can still give a rough model ranking:

8755f6382b79916d0ca9f9677d4aae41.png

In general:

  • We rank model performance against GSM8K, a classic benchmark for measuring chained-think mathematical reasoning performance. This is not the only metric, but a good interpretation is "how well does the model perform mathematically while maintaining other general capabilities" - which is also very difficult.

  • GPT-4 significantly outperforms all other models on GSM8K and MMLU.

  • 65B LLaMA is very close to text/code-davinci-002, which means that based on it, if SFT and RLHF operate correctly, we have a good chance of reproducing ChatGPT based on 65B LLaMA.

  • Claude is the only model family comparable to the GPT series.

  • Smaller models, such as FlanT5 11B and LLaMA 7B, lag significantly behind the leaderboard, implying that complex inference may only be a capability of larger models.

Further, in the Github repository, we include:

  • Detailed experimental setup and analysis of results

  • Script to reproduce all results from GPT and Claude

    Give it a try :)

5

in conclusion

In this post, we discuss the inference capabilities of large language models. Complex inference is not only because it is the core distinguishing point between stronger models and weaker models, but also because it is the basic ability for models to become the next generation of computing platforms/operating systems, making it possible to build a new ecosystem on large models system.

We discuss ways to build models with powerful reasoning capabilities: pre-training, supervised fine-tuning, and reinforcement learning. We found that ways to improve reasoning were strongly correlated with ways to improve code, reinforcing our previous assumptions about the close relationship between reasoning and code. We further discuss advanced hint engineering techniques and analysis of model behavior when performing complex inferences.

Finally, we discuss how to evaluate a model's inference ability and introduce the chain-of-thought hub, an ongoing project to uniformly evaluate the inference performance of language models.

We hope this article serves as a roadmap for building open-source models with powerful inference capabilities.

Millions of idle hours pass by before a truly historic moment emerges, the moment when the human star shines - Stefan Zweig, "When the Human Stars Shine"


Appendix: More resources for large language model inference

  • Lil’Log 2023. Prompt Engineering

  • Microsoft Semantic Kernel

  • Prompt Engineering GuideHuang and Chang 2022. Towards Reasoning in Large Language Models: A Survey

everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/

e8da69d87e7f72aa5badd50250bd94c0.png

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/130613071