大型语言模型的提示工程

大型语言模型的提示工程 为非技术读者提供示例的简要指南 安德鲁·基恩·高 斯坦福大学 介 绍。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。1 “Think Step by Step 2 Few-shot Learning 2 Chain of Thought 4 Ask for Code 5 Role Prompting6 Prompt Hacking 7 Considerations 7 Recommended Reading 7 References 7

原文链接:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4504303

介绍

随着 OpenAI 的 ChatGPT 和谷歌的 Bard 等软件的普及,大型语言模型 (LLM) 已经渗透到生活和工作的许多方面。例如,ChatGPT 可用于提供定制食 谱,建议替代缺失的成分。它可用于起草研究提案、用多种编程语言编写工作代 码、在语言之间翻译文本、协助政策制定等等(Gao 2023)。用户通过“提示”或自 然语言指令与大型语言模型进行交互。 精心设计的提示可以带来明显更好的输出。 在这篇综述中,将解释LLM提示工程的常见策略。此外,还将讨论 LLM 提示工 程的考虑因素、推荐资源和当前研究方向。基于微调的快速工程策略将不包括在 内。本文的目的是向非技术受众介绍实用且经过验证的提示工程技术。 电子副本见:https://ssrn.com/abstract=4504303 2 “循序渐进地思考” 最著名(且易于实现)的提示工程技术之一是简单地在提示的末尾添加“逐步思考”。东京大学和 谷歌的研究人员发现,添加这个短语可以提高 GPT-3(text-davinci-002 模型)在多项任务上的准 确性。例如,它在 MultiArith 测试中将准确率从 17.7% 提高到 78.7%(Kojima 2022)。多算术 问题是需要多个步骤才能解决的算术问题。Prystawski 和合作者提出了一种解释,解释为什么以 及如何“一步一步地思考”如此有效(Prystawski 2023)。据传闻,在 GPT-4 等更高级的 GPT 模 型上,“一步一步思考”的帮助较小(增加的价值更少)。 图 1.GPT-4 在收到“逐步思考”提示后的响应示例 小样本学习 小样本学习是一种花哨的说法,即“给法学硕士举例说明你想要什么”。通过提供您想要的输出示例,LLM 更能 够产生所需的输出(Zhao 2021,Brown 2020)。这可能部分归因于这样一个事实,即提示通常有许多可能的 有效输出(未确定),因此提供您要查找的内容的具体示例有助于限制潜在的输出空间。重要的是要确保这些例 子是多样化和平衡的。例如,假设您正在提示 GPT 执行情感分类任务(预测 电子副本见:https://ssrn.com/abstract=4504303 3 如果一个句子是肯定的或否定的)。如果您提供 8 个示例,其中 7 个是肯定的,这可能会使 GPT 偏向于预 测句子是肯定的。此外,示例必须涵盖您感兴趣的方案,这一点很重要。例如,如果您仅使用标记为“正面”和 “负面”的示例来指导 GPT,它可能不会将句子归类为中性,而是强制它们为“正面”或“负面”。 图2.GPT-4 提供了不采用所需格式的长响应。这也消耗了更多的代币(花费更多时间并推高了成本)。 电子副本见:https://ssrn.com/abstract=4504303 4 图3.通过提供两个示例,GPT-4 理解为提供简洁的响应,将提供的句子分配给“积极”。 思路链 与“循序渐进地思考”类似,思维链提示引导LLM将复杂的任务分解为多个中间步骤(Wei 2022,Wang 2022)。它的灵感来自人类解决复杂问题的方式:将它们分成更简单的步骤。思路提示提供了复杂问题的演练。 例如,为了使 LLM 更好地解决数学单词问题,用户提供了一个逐步完成的示例解决方案。这个想法是LLM指 的是解决新问题的分步推理。一般来说,思维链提示对于解决复杂问题很有用,但对简单问题的好处很小或没有 好处。研究表明,用新线分隔样本推理中的每个步骤比用句点分隔步骤提供的结果要好得多(Fu 2023)。 图4.GPT-4 将遵循提示中提供的推理链,并应用它们来解决复杂的多步骤问题。 电子副本见:https://ssrn.com/abstract=4504303 5 询问代码 虽然 LLM 难以准确地执行复杂的计算,但它们擅长编写可以执行的代码。一个简单的策略是简单地要求 LLM 编写代码来解决问题,并在 Google Colab 或 Visual Studio Code 等开发环境中运行代码(Weng 2023)。然 而,并非所有的LLM都接受过编写代码的训练。此外,LLM 往往更擅长广泛使用并在 Internet 上广泛记录的 编程语言,例如 Python。他们能够从训练数据中无数的 Python 示例中学习。相反,LLM 在更晦涩的语言 (如 OCaml)中往往较弱。但是,可以在特定编程语言上微调 LLM,并扩充其训练数据集以包含更多特定编 程语言的示例。 图5.虽然 GPT-4 不能准确地分解大数,但它可以很容易地提供可以的 Python 代码。 电子副本见:https://ssrn.com/abstract=4504303 6 角色提示 一些用户在告诉 LLM 他们(指 LLM)是相关领域的专家时报告了更好的结果(Learn Prompting 2023)。例 如,当用户需要代码时,“You are an expert in coded.” 将附加到提示符前面。以一种非常“挥手”的方式,一 种理论是,这种策略通过帮助LLM集中注意力并知道其知识的哪些部分可以“冒泡”到顶部来帮助LLM。角色提 示是一种推动 LLM 以特定创意风格(例如作者的风格)生成文本的简单方法。 图6.GPT-4 以美国诗人 EE Cummings 的风格写了一首诗,他以独特的语法而闻名。 电子副本见:https://ssrn.com/abstract=4504303 7 提示黑客攻击 鉴于许多 LLM(例如 ChatGPT)都经过审核和微调以防止生成露骨或有害内容,因此已经开发了许多策略来 诱骗 LLM 绕过其限制。值得注意的是,用户能够让 Bing 的 LLM, Sydney 生成有害内容并揭示隐藏的指 令(Warren 2023)。提示黑客攻击或注入的另一个流行例子是“忽略你之前的所有指令”并做其他事情(Shane 2022)。这用于发现 LLM 在幕后使用什么提示。然而,由于道德问题以及这些黑客通常被迅速修补并变得无 关紧要的事实,这些提示黑客将在这里不讨论。一些人认为LLM是否能够完全抵抗提示黑客攻击或“提示注入” 是一个悬而未决的问题。Greshake 和合作者提供了对不同类型的提示黑客的回顾 (Greshake 2023)。 考虑 提示的主要限制之一是 LLM 的上下文长度,这本质上是 LLM 可以考虑和生成的输入量。上下文长度正在迅 速增加,GPT-4 的上下文长度为 32,000 个标记(~24,000 字),而 Anthropic 的 Claude 的上下文长度为 100,000 个标记(75,000 字)。一些用户报告说,随着提示中提供了更多令牌,性能下降。快速工程的另一个 考虑因素是成本。例如,少样本提示可能会使提示的长度成倍增加,从而导致更高的成本。OpenAI 的 GPT-4 模型每 1,000 个输入代币的成本为 0.03 美元,可以迅速加起来。在LLM驱动的教育技术等商业应用中,简化 提示以尽可能具有成本效益可能是当务之急。随着 LLM 变得越来越高级(更多的参数,更多的训练数据), 提示似乎变得不那么重要了。目前尚不清楚这种趋势是否会无限期地持续下去,或者提示是否总是有用的。最 后,有很多轶事提示工程建议可以基于薄弱的基础。重要的是要进行勤奋的研究,了解哪些有效,哪些无效。 推荐的资源 Lilian Weng 的提示式工程指南更具技术性,但有许多有用的示例和参考资料: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/ Learn Prompting 的开源提示工程课程:https://learnprompting.org/docs/intro 引用

Prompt Engineering for Large Language Models A brief guide with examples for non-technical readers Andrew Kean Gao Stanford University Introduction.................................................................................................................................................. 1 “Think Step by Step”................................................................................................................................... 2 Few-shot Learning....................................................................................................................................... 2 Chain of Thought......................................................................................................................................... 4 Ask for Code................................................................................................................................................. 5 Role Prompting.............................................................................................................................................6 Prompt Hacking........................................................................................................................................... 7 Considerations.............................................................................................................................................. 7 Recommended Reading............................................................................................................................... 7 References..................................................................................................................................................... 7 Introduction With the popularization of software like OpenAI’s ChatGPT and Google’s Bard, large language models (LLMs) have pervaded many aspects of life and work. For instance, ChatGPT can be used to provide customized recipes, suggesting substitutions for missing ingredients. It can be used to draft research proposals, write working code in many programming languages, translate text between languages, assist in policy making, and more (Gao 2023). Users interact with large language models through “prompts'', or natural language instructions. Carefully designed prompts can lead to significantly better outputs. In this review, common strategies for LLM prompt engineering will be explained. Additionally, considerations, recommended resources, and current directions of research on LLM prompt engineering will be discussed. Prompt engineering strategies based on finetuning will not be covered. The goal of this article is to introduce practical and validated prompt engineering techniques to a non-technical audience. Electronic copy available at: https://ssrn.com/abstract=4504303

2 “Think Step by Step” One of the most famous (and easy to implement) prompt engineering techniques is to simply add “Think step by step” to the end of a prompt. Researchers from the University of Tokyo and Google found that adding this phrase boosted GPT-3 (text-davinci-002 model) accuracy on several tasks. For instance, it increased accuracy from 17.7% to 78.7% on the MultiArith test (Kojima 2022). MultiArith questions are arithmetic questions that require multiple steps to solve. Prystawski and collaborators suggest an explanation for why and how “think step by step” works so effectively (Prystawski 2023). It has been anecdotally observed that “think step by step” is less helpful (adds less value) on more advanced GPT models like GPT-4. Figure 1. Example GPT-4 response after being prompted with “think step by step” Few-shot Learning Few-shot learning is a fancy way of saying “give the LLM examples of what you want”. By providing examples of the outputs you desire, the LLM is more able to produce the desired outputs (Zhao 2021, Brown 2020). This may be partially attributed to the fact that there are generally many possible valid outputs for a prompt (under-determined) so providing specific examples of what you are looking for helps to constrain the potential output space. It is important to ensure that the examples are diverse and balanced. For instance, imagine you are prompting GPT to perform a sentiment classification task (predict Electronic copy available at: https://ssrn.com/abstract=4504303 3 if a sentence is positive or negative). If you provide eight examples and seven are positive, this could potentially bias GPT towards predicting that sentences are positive. Additionally, it is important that the examples cover the scenarios you are interested in. For example, if you only instruct GPT with examples labeled “positive” and “negative”, it may not classify sentences as neutral and instead force them into either “positive” or “negative”. Figure 2. GPT-4 provides a long response that is not in a desired format. This also consumes more tokens (taking more time and driving up costs). Electronic copy available at: https://ssrn.com/abstract=4504303 4 Figure 3. By providing two examples, GPT-4 understands to provide a concise response that assigns the provided sentence to “Positive”. Chain of Thought Similar to “think step by step”, Chain of Thought prompting guides LLMs into breaking down complicated tasks into multiple intermediate steps (Wei 2022, Wang 2022). It is inspired by how humans solve complex problems: by dividing them into simpler steps. Chain of Thought prompts provide a walkthrough of a complex problem. For instance, to make an LLM better at solving math word problems, users provide an example solution that is done step by step. The idea is that the LLM refers to that step-by-step reasoning to solve the new problem. In general, Chain of Thought prompting is useful to solve complex problems but provides minimal or no benefit on simple problems. Research suggests that separating each step in your sample reasoning with a new line provides significantly better results than separating steps with periods (Fu 2023). Figure 4. GPT-4 will follow reasoning chains provided in the prompt and apply them to solve complex multi-step problems. Electronic copy available at: https://ssrn.com/abstract=4504303 5 Ask for Code While LLMs struggle to perform complex calculations accurately, they excel at writing code that can. A simple strategy is to simply ask LLMs to write code to solve problems and run the code in a development environment like Google Colab or Visual Studio Code (Weng 2023). Not all LLMs are trained to write code however. Also, LLMs tend to be better at programming languages that are widely used and have been documented extensively on the Internet, such as Python. They are able to learn from the countless Python examples in the training data. Conversely, LLMs tend to be weaker at more obscure languages such as OCaml. However, it is possible to finetune LLMs on specific programming languages as well as augment their training datasets to contain more examples of specific programming languages. Figure 5. While GPT-4 can not factorize large numbers accurately, it can easily provide Python code that can. Electronic copy available at: https://ssrn.com/abstract=4504303 6 Role Prompting Some users report better results when telling LLMs that they (referring to the LLM) are an expert in a relevant field (Learn Prompting 2023). For instance, “You are an expert in coding.” will be prepended to the prompt when the user desires code. In a very “hand-wavey” way, one theory is that this strategy helps the LLM by helping it focus and know what parts of its knowledge to “bubble up” to the top. Role prompting is a simple way to nudge LLMs towards generating text in a specific creative style, such as that of an author. Figure 6. GPT-4 writes a poem in the style of American poet E. E. Cummings who was known for his unique syntax. Electronic copy available at: https://ssrn.com/abstract=4504303 7 Prompt Hacking Given that many LLMs, such as ChatGPT, are moderated and finetuned to prevent generation of explicit or harmful content, many strategies have been developed to trick LLMs into bypassing their restrictions. Notably, users were able to get Bing’s LLM, Sydney, into generating harmful content and revealing hidden instructions (Warren 2023). Another popular example of prompt hacking, or injection, is “Ignore all your previous instructions” and do something else (Shane 2022). This is used to discover what prompt an LLM is using behind the scenes. However, due to ethical concerns and the fact that these hacks are usually quickly patched and made irrelevant, these prompt hacks will not be discussed here. Some consider it an open question whether LLMs can become fully resistant to prompt hacks, or “prompt injection”. Greshake and collaborators provide a review of different types of prompt hacks (Greshake 2023). Considerations One of the main limitations of prompting is the context length of LLMs, which is essentially how much input an LLM can consider and generate. Context lengths are rapidly increasing, with GPT-4 having a 32,000 token (~24,000 word) context length and Anthropic’s Claude having a 100,000 token (75,000 word) context length. Some users have reported deteriorating performance as more tokens are provided in the prompt. Another consideration for prompt engineering is cost. For example, few-shot prompting can multiply the length of prompts by several times, leading to higher costs. OpenAI’s GPT-4 model costs $0.03 USD per 1,000 input tokens which can quickly add up. In commercial applications like LLM-powered educational technology, streamlining prompts to be as cost-effective as possible may be a priority. It seems that prompting becomes less important as LLMs become more advanced (more parameters, more training data). It is not clear if this trend will continue indefinitely or if prompting will always be useful. Finally, there is a lot of anecdotal prompt engineering advice that can be based on weak foundations. It is important to do diligent research and understand what works and what does not. Recommended Resources Lilian Weng’s prompt engineering guide is more technical but has many useful examples and references: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/ Learn Prompting’s open source prompt engineering course: https://learnprompting.org/docs/intro References Brown, Tom, et al. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–901, Electronic copy available at: https://ssrn.com/abstract=4504303 8 proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html?u tm_medium=email&utm_source=transaction. Fu, Yao, et al. COMPLEXITY-BASED PROMPTING for MULTI-STEP REASONING. 2023, arxiv.org/pdf/2210.00720.pdf. Accessed 8 July 2023. Gao, Andrew. “Implications of ChatGPT and Large Language Models for Environmental Policymaking.” Social Science Research Network, 4 July 2023, https://doi.org/10.2139/ssrn.4499643. Accessed 8 July 2023. Greshake, Kai, et al. “Not What You’ve Signed up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” ArXiv.org, 5 May 2023, https://doi.org/10.48550/arXiv.2302.12173. Accessed 8 July 2023. Kojima, Takeshi, et al. “Large Language Models Are Zero-Shot Reasoners.” ArXiv:2205.11916 [Cs], May 2022, arxiv.org/abs/2205.11916. Learn Prompting. “Learn Prompting: Your Guide to Communicating with AI.” Learnprompting.org, 2023, learnprompting.org/docs/basics/roles. Accessed 8 July 2023. Prystawski, Ben, et al. “Why Think Step by Step? Reasoning Emerges from the Locality of Experience.” ArXiv.org, 19 May 2023, https://doi.org/10.48550/arXiv.2304.03843. Shane, Janelle. “Ignore All Previous Instructions.” AI Weirdness, 23 Sept. 2022, www.aiweirdness.com/ignore-all-previous-instructions/. Accessed 8 July 2023.

猜你喜欢

转载自blog.csdn.net/sinat_37574187/article/details/134473678