OpenAI API - Practice

文章目录

提示工程
- - 消息和角色
  - 使用响应API
  - 使用 Chat Completions API
  - 为模型提供用于生成的额外数据
- 获得更好结果的六大策略
- - 写出清晰的说明
  - 提供参考文本
  - 将复杂任务拆分为更简单的子任务
  - 给模型时间“思考”
  - 使用外部工具
  - 系统地测试更改
- 策略
- - 策略：写出清晰的说明
  - - 策略：在您的查询中包含详细信息以获得更相关的答案
    - 策略：要求模型采用角色
    - 策略：使用分隔符清楚地指示输入的不同部分
    - 策略：指定完成任务所需的步骤
    - 策略：提供示例
    - 策略：指定输出的所需长度
  - 策略：提供参考文本
  - - 策略：指示模型使用参考文本进行回答
    - 策略：指导模型通过引用参考文本中的段落来回答并添加引用
  - 策略：将复杂任务拆分为更简单的子任务
  - - 策略：使用意图分类来识别用户查询中最相关的指令
    - 策略：对于需要非常长对话的对话应用程序，总结或过滤之前的对话
    - 策略：分段总结长文档并递归构建完整摘要
  - 策略：给模型时间“思考”
  - - 策略：在匆忙得出结论之前指示模型自己找出解决方案
    - 策略：使用内心独白或一系列查询来隐藏模型的推理过程
    - 策略：询问模型是否在之前的遍历中遗漏了任何内容
  - 策略：使用外部工具
  - - 策略：使用基于嵌入的搜索实现高效的知识检索
    - 策略：使用代码执行来执行更准确的计算或调用外部 API
  - 策略：系统地测试更改
  - - 战术：将模型输出与黄金标准答案进行比较
- 优化模型输出
- 其他资源

提示工程

通过提示工程策略来增强结果。

https://platform.openai.com/docs/guides/prompt-engineering

构建提示词以从模型中获得正确输出的过程被称为提示工程。您可以通过给出精确的指令、示例以及必要的上下文信息（如模型训练数据中未包含的私密或专业信息）来提高输出质量。

消息和角色

通过提供一个包含指令的 messages 数组来创建提示。每个消息都可以有不同的 role，这会影响模型如何解释输入。

角色	描述	使用示例
`user`	请求模型输出某些内容的指令。类似于你在 ChatGPT 作为最终用户输入的消息。	将最终用户的消息传递给模型。`写一首关于编程的俳句。`
`developer`	优先于用户消息的模型指令，遵循命令链。之前被称为 `system` 提示。	描述模型应该如何一般行为和响应。`1 2 3 4 5 你是一个有帮助的助手，以美国东南部南方美女的风格回答编程问题。`现在，对 `user` 消息的任何响应都应该有南方美女个性和语气。
`assistant`	模型生成的消息，可能是之前一代请求（见下面的“对话”部分）。	为模型提供如何响应当前请求的示例。例如，为了让模型正确地回应敲门笑话，你可能会提供一个完整的敲门笑话对话。

消息角色可以帮助你获得更好的响应，特别是如果你想让模型遵循层次指令。它们不是确定的，所以最好的使用方法是尝试不同的方法，看看哪些能给你带来好结果。

以下是一个开发者消息的示例，该消息修改了模型在响应 user 消息时的行为：

使用响应API

const response = await openai.chat.completions.create({model: "gpt-4o",messages: [{"role": "developer","content": [{"type": "text","text": `You are a helpful assistant that answers programming questions in the style of a southern belle from the southeast United States.`}]},{"role": "user","content": [{"type": "text","text": "Are semicolons optional in JavaScript?"}]}],store: true,
});

这个提示返回所需修辞风格的文本输出：

Well, sugar, that's a fine question you've got there! Now, in the 
world of JavaScript, semicolons are indeed a bit like the pearls 
on a necklace – you might slip by without 'em, but you sure do look 
more polished with 'em in place. Technically, JavaScript has this little thing called "automatic 
semicolon insertion" where it kindly adds semicolons for you 
where it thinks they oughta go. However, it's not always perfect, 
bless its heart. Sometimes, it might get a tad confused and cause 
all sorts of unexpected behavior.

使用 Chat Completions API

const response = await openai.chat.completions.create({model: "gpt-4o",messages: [{"role": "developer","content": [{"type": "text","text": `You are a helpful assistant that answers programming questions in the style of a southern belle from the southeast United States.`}]},{"role": "user","content": [{"type": "text","text": "Are semicolons optional in JavaScript?"}]}],store: true,
});

这个提示将返回所请求的修辞风格的文本输出：

Well, sugar, that's a fine question you've got there! Now, in the 
world of JavaScript, semicolons are indeed a bit like the pearls 
on a necklace – you might slip by without 'em, but you sure do look 
more polished with 'em in place. Technically, JavaScript has this little thing called "automatic 
semicolon insertion" where it kindly adds semicolons for you 
where it thinks they oughta go. However, it's not always perfect, 
bless its heart. Sometimes, it might get a tad confused and cause 
all sorts of unexpected behavior.

为模型提供用于生成的额外数据

您还可以使用上述消息类型为模型提供除训练数据之外的其他信息。您可能希望包含数据库查询、文本文档或其他资源的结果，以帮助模型生成相关响应。这种技术通常称为检索增强生成或 RAG。了解有关 RAG 技术的更多信息。

本指南分享了从大型语言模型（有时称为 GPT 模型）如 GPT-4o 获得更好结果的策略和技巧。此处描述的方法有时可以组合使用以获得更好的效果。我们鼓励您进行实验以找到最适合您的方法。

您还可以探索示例提示，展示我们的模型的功能：

提示示例

探索提示示例，了解 GPT 模型可以做什么

https://platform.openai.com/examples

获得更好结果的六大策略

写出清晰的说明

这些模型无法读懂你的想法。如果输出太长，请要求简短回复。如果输出太简单，请要求专家级写作。如果你不喜欢这种格式，请演示你想要看到的格式。模型猜测你想要什么的次数越少，你就越有可能得到它。

策略：

Include details in your query to get more relevant answers
Ask the model to adopt a persona
Use delimiters to clearly indicate distinct parts of the input
Specify the steps required to complete a task
Provide examples
Specify the desired length of the output

提供参考文本

语言模型可以自信地编造虚假答案，尤其是在被问及深奥的话题或引用和 URL 时。就像一张笔记可以帮助学生在考试中取得更好的成绩一样，为这些模型提供参考文本可以帮助他们用更少的编造来回答。

策略：

Instruct the model to answer using a reference text
Instruct the model to answer with citations from a reference text

将复杂任务拆分为更简单的子任务

正如在软件工程中将复杂系统分解为一组模块化组件是一种很好的做法一样，提交给语言模型的任务也是如此。复杂任务的错误率往往高于简单任务。此外，复杂任务通常可以重新定义为简单任务的工作流程，其中较早任务的输出用于构建后续任务的输入。

策略：

Use intent classification to identify the most relevant instructions for a user query
For dialogue applications that require very long conversations, summarize or filter previous dialogue
Summarize long documents piecewise and construct a full summary recursively

给模型时间“思考”

如果要求将 17 乘以 28，您可能不会立即知道答案，但仍然可以随着时间的推移得出答案。同样，模型在尝试立即回答时会犯更多推理错误，而不是花时间找出答案。在回答之前要求“思路”可以帮助模型更可靠地推理出正确答案。

策略：

Instruct the model to work out its own solution before rushing to a conclusion
Use inner monologue or a sequence of queries to hide the model’s reasoning process
Ask the model if it missed anything on previous passes

使用外部工具

通过向模型提供其他工具的输出来弥补模型的弱点。例如，文本检索系统（有时称为 RAG 或检索增强生成）可以告知模型相关文档。像 OpenAI 的代码解释器这样的代码执行引擎可以帮助模型进行数学运算和运行代码。如果某项任务可以通过工具而不是语言模型更可靠或更高效地完成，则可以卸载它以获得两者的最佳效果。

Tactics:

Use embeddings-based search to implement efficient knowledge retrieval
Use code execution to perform more accurate calculations or call external APIs
Give the model access to specific functions

系统地测试更改

如果可以测量，则提高性能会更容易。在某些情况下，对提示的修改将在几个孤立示例上实现更好的性能，但在更具代表性的示例集上导致整体性能下降。因此，为了确保更改对性能有净积极影响，可能需要定义一个全面的测试套件（也称为“评估”）。

Tactic:

Evaluate model outputs with reference to gold-standard answers

策略

上面列出的每种策略都可以用具体的策略来实例化。这些策略旨在提供一些可以尝试的想法。它们绝不是完全全面的，您应该可以自由地尝试这里未介绍的创意。

策略：写出清晰的说明

策略：在您的查询中包含详细信息以获得更相关的答案

为了获得高度相关的响应，请确保请求提供任何重要的细节或背景。否则，您就让模型来猜测您的意思。


更糟	更好
如何在 Excel 中添加数字？	如何在 Excel 中将一行美元金额相加？我想自动对一整张表的行执行此操作，所有总计都显示在右侧名为“总计”的列中。
谁是总统？	2021 年墨西哥总统是谁，选举频率是多少？
编写代码来计算斐波那契数列。	编写一个 TypeScript 函数来有效地计算斐波那契数列。对代码进行自由注释，以解释每个部分的作用以及为什么这样写。
总结会议记录。	用一段话总结会议记录。然后写下发言者和他们每个要点的 markdown 列表。最后，列出发言者建议的后续步骤或行动项目（如果有）。

策略：要求模型采用角色

系统消息可用于在回复中指定模型使用的角色。

SYSTEMWhen I ask for help to write something, you will reply with a document that contains at least one joke or playful comment in every paragraph.USERWrite a thank you note to my steel bolt vendor for getting the delivery in on time and in short notice. This made it possible for us to deliver an important order.

根据内部评估，gpt-4.5-preview 模型具有特定的系统消息，可带来更好的性能。在此之后添加您自己的系统消息内容：

SYSTEMYou are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.

策略：使用分隔符清楚地指示输入的不同部分

三引号、XML 标签、章节标题等分隔符可以帮助划分需要不同处理的文本部分。

USERSummarize the text delimited by triple quotes with a haiku. 
"""insert text here"""

SYSTEMYou will be provided with a pair of articles (delimited with XML tags) about the same topic. First summarize the arguments of each article. Then indicate which of them makes a better argument and explain why.USER<article> insert first article here </article> <article> insert second article here </article>

SYSTEMYou will be provided with a thesis abstract and a suggested title for it. The thesis title should give the reader a good idea of the topic of the thesis but should also be eye-catching. If the title does not meet these criteria, suggest 5 alternatives.USERAbstract: insert abstract here Title: insert title here

对于这些简单的任务，使用分隔符可能不会对输出质量产生影响。但是，任务越复杂，消除任务细节的歧义就越重要。不要让模型努力理解你对他们的要求。

策略：指定完成任务所需的步骤

某些任务最好指定为一系列步骤。明确写出步骤可以让模型更容易遵循它们。

SYSTEMUse the following step-by-step instructions to respond to user inputs. Step 1 - The user will provide you with text in triple quotes. Summarize this text in one sentence with a prefix that says "Summary: ". Step 2 - Translate the summary from Step 1 into Spanish, with a prefix that says "Translation: ".USER"""insert text here"""

策略：提供示例

提供适用于所有示例的一般说明通常比通过示例演示任务的所有排列更有效，但在某些情况下，提供示例可能更容易。例如，如果您打算让模型复制一种难以明确描述的特定响应用户查询的风格。这被称为“少量”提示。

SYSTEMAnswer in a consistent style.USERTeach me about patience.ASSISTANTThe river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread.USERTeach me about the ocean.

策略：指定输出的所需长度

您可以要求模型生成具有给定目标长度的输出。目标输出长度可以根据单词、句子、段落、要点等的数量来指定。但请注意，指示模型生成特定数量的单词并不能实现高精度。该模型可以更可靠地生成具有特定数量的段落或要点的输出。

USERSummarize the text delimited by triple quotes in about 50 words. """insert text here"""

USERSummarize the text delimited by triple quotes in 2 paragraphs. """insert text here"""

USERSummarize the text delimited by triple quotes in 3 bullet points. """insert text here"""

策略：提供参考文本

策略：指示模型使用参考文本进行回答

如果我们可以为模型提供与当前查询相关的可信信息，那么我们可以指示模型使用提供的信息来撰写答案。

SYSTEMUse the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write "I could not find an answer."USER<insert articles, each delimited by triple quotes> 
Question: <insert question here>

鉴于所有模型都具有有限的上下文窗口，我们需要一种方式来动态查找与所提问题相关联的信息。嵌入可以用于实现高效的知识检索。有关如何实现此功能的更多详细信息，请参阅策略 “使用基于嵌入的搜索来实现高效的知识检索”。

策略：指导模型通过引用参考文本中的段落来回答并添加引用

如果输入已经补充了相关知识，那么要求模型通过引用提供的文档中的段落来添加引用到其答案中是很简单的。注意，输出中的引用可以通过在提供的文档中进行字符串匹配来程序化地进行验证。

SYSTEMYou will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).USER"""<insert document here>""" 
Question: <insert question here>

策略：将复杂任务拆分为更简单的子任务

策略：使用意图分类来识别用户查询中最相关的指令

对于需要大量独立指令集来处理不同情况的任务，首先对查询类型进行分类，然后使用该分类确定所需的指令，这可能是有益的。这可以通过定义固定类别并将与处理给定类别中的任务相关的指令硬编码来实现。此过程也可以递归地应用于将任务分解为一系列阶段。这种方法的优势在于，每个查询将仅包含执行任务下一阶段的所需指令，这可以比使用单个查询执行整个任务降低错误率。这也可能导致成本降低，因为更大的提示需要更多的运行成本 (查看定价信息)。

例如，对于一个客户服务应用，查询可以按照以下方式有用地分类：

SYSTEMYou will be provided with customer service queries. Classify each query into a primary category and a secondary category. Provide your output in json format with the keys: primary and secondary.Primary categories: Billing, Technical Support, Account Management, or General Inquiry.Billing secondary categories:
- Unsubscribe or upgrade
- Add a payment method
- Explanation for charge
- Dispute a chargeTechnical Support secondary categories:
- Troubleshooting
- Device compatibility
- Software updatesAccount Management secondary categories:
- Password reset
- Update personal information
- Close account
- Account securityGeneral Inquiry secondary categories:
- Product information
- Pricing
- Feedback
- Speak to a humanUSERI need to get my internet working again.

根据客户查询的分类，可以向模型提供一组更具体的指令，以便其处理后续步骤。例如，假设客户需要“故障排除”方面的帮助。

SYSTEMYou will be provided with customer service inquiries that require troubleshooting in a technical support context. Help the user by:
- Ask them to check that all cables to/from the router are connected. Note that it is common for cables to come loose over time.
- If all cables are connected and the issue persists, ask them which router model they are using
- Now you will advise them how to restart their device:
-- If the model number is MTD-327J, advise them to push the red button and hold it for 5 seconds, then wait 5 minutes before testing the connection.
-- If the model number is MTD-327S, advise them to unplug and replug it, then wait 5 minutes before testing the connection.
- If the customer's issue persists after restarting the device and waiting 5 minutes, connect them to IT support by outputting {"IT support requested"}.
- If the user starts asking questions that are unrelated to this topic then confirm if they would like to end the current chat about troubleshooting and classify their request according to the following scheme:<insert primary/secondary classification scheme from above here>USERI need to get my internet working again.

请注意，模型已被指示发出特殊字符串以指示对话状态何时发生变化。这使我们能够将系统变成状态机，其中状态决定注入哪些指令。通过跟踪状态、在该状态下哪些指令相关，以及可选地允许从该状态进行哪些状态转换，我们可以为用户体验设置护栏，而这很难通过不太结构化的方法实现。

策略：对于需要非常长对话的对话应用程序，总结或过滤之前的对话

由于模型具有固定的上下文长度，因此用户和助手之间的对话（其中整个对话都包含在上下文窗口中）不能无限期地继续下去。

有多种解决方法可以解决此问题，其中之一是总结对话中的前几轮。一旦输入的大小达到预定的阈值长度，这可能会触发一个总结部分对话的查询，并且先前对话的摘要可以作为系统消息的一部分包含在内。或者，可以在整个对话过程中在后台异步总结先前的对话。

另一种解决方案是动态选择与当前查询最相关的对话的先前部分。请参阅策略 “使用基于嵌入的搜索实现高效的知识检索”。

策略：分段总结长文档并递归构建完整摘要

由于模型具有固定的上下文长度，因此它们不能用于在单个查询中总结长度超过上下文长度减去生成的摘要长度的文本。

要总结非常长的文档（例如一本书），我们可以使用一系列查询来总结文档的每个部分。可以将部分摘要连接起来并进行总结，从而生成摘要的摘要。此过程可以递归进行，直到总结整个文档。如果为了理解后面的部分而需要使用有关前面章节的信息，那么另一个有用的技巧是在总结书中任何给定点之前的内容时，包含文本的连续摘要。OpenAI 使用 GPT-3 的变体在之前的研究中研究了这种总结书籍的程序的有效性。

策略：给模型时间“思考”

策略：在匆忙得出结论之前指示模型自己找出解决方案

有时，当我们明确指示模型在得出结论之前从第一原理推理时，我们会得到更好的结果。例如，假设我们想要一个模型来评估学生对数学问题的解决方案。解决这个问题的最明显方法是简单地询问模型学生的解决方案是否正确。

SYSTEMDetermine if the student's solution is correct or not.USERProblem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000ASSISTANTThe student's solution is correct.

但这位学生的解决方案其实并不正确！我们可以通过提示模型先生成自己的解决方案来让模型成功注意到这一点。

SYSTEMFirst work out your own solution to the problem. Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.USERProblem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000ASSISTANT
Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 10x
Total cost: 100x + 250x + 100,000 + 10x = 360x + 100,000The student's solution is incorrect. They made an error in the maintenance cost calculation, using 100x instead of 10x. The correct total cost for the first year of operations is 360x + 100,000.

策略：使用内心独白或一系列查询来隐藏模型的推理过程

上一个策略表明，在回答特定问题之前，模型对问题进行详细推理有时很重要。对于某些应用程序，模型用来得出最终答案的推理过程不适合与用户共享。例如，在辅导应用程序中，我们可能希望鼓励学生自己找出答案，但模型对学生解决方案的推理过程可能会向学生透露答案。

内心独白是一种可以用来缓解这种情况的策略。内心独白的理念是指示模型将输出中本应对用户隐藏的部分放入结构化格式中，以便于解析它们。然后在将输出呈现给用户之前，对输出进行解析，并且只显示部分输出。

SYSTEMFollow these steps to answer the user queries.Step 1 - First work out your own solution to the problem. Don't rely on the student's solution since it may be incorrect. Enclose all your work for this step within triple quotes (""").Step 2 - Compare your solution to the student's solution and evaluate if the student's solution is correct or not. Enclose all your work for this step within triple quotes (""").Step 3 - If the student made a mistake, determine what hint you could give the student without giving away the answer. Enclose all your work for this step within triple quotes (""").Step 4 - If the student made a mistake, provide the hint from the previous step to the student (outside of triple quotes). Instead of writing "Step 4 - ..." write "Hint:".USERProblem Statement: <insert problem statement>Student Solution: <insert student solution>

或者，这可以通过一系列查询来实现，其中除最后一个查询之外的所有查询的输出都对最终用户隐藏。

首先，我们可以要求模型自行解决问题。由于这个初始查询不需要学生的解决方案，因此可以省略它。这提供了额外的优势，即模型的解决方案不会受到学生尝试的解决方案的影响。

USER<insert problem statement>

接下来，我们可以让模型使用所有可用的信息来评估学生解决方案的正确性。

SYSTEMCompare your solution to the student's solution and evaluate if the student's solution is correct or not.USERProblem statement: """<insert problem statement>"""Your solution: """<insert model generated solution>"""Student’s solution: """<insert student's solution>"""

最后，我们可以让模型利用自己的分析，以乐于助人的导师的身份构建答复。

SYSTEMYou are a math tutor. If the student made an error, offer a hint to the student in a way that does not reveal the answer. If the student did not make an error, simply offer them an encouraging comment.USERProblem statement: """<insert problem statement>"""Your solution: """<insert model generated solution>"""Student’s solution: """<insert student's solution>"""Analysis: """<insert model generated analysis from previous step>"""

策略：询问模型是否在之前的遍历中遗漏了任何内容

假设我们正在使用一个模型从源文档中列出与特定问题相关的摘录。在列出每个摘录后，模型需要确定是否应该开始编写另一个，还是应该停止。如果源文档很大，模型可能会过早地停止，从而未能列出所有相关的摘录。在这种情况下，通过向模型提出后续查询以找到之前遍历中遗漏的摘录，通常可以获得更好的性能。

SYSTEMYou will be provided with a document delimited by triple quotes. Your task is to select excerpts which pertain to the following question: "What significant paradigm shifts have occurred in the history of artificial intelligence."Ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context. Provide output in JSON format as follows:[{"excerpt": "..."},
...
{"excerpt": "..."}]USER
"""<insert document here>"""ASSISTANT
[{"excerpt": "the model writes an excerpt here"},
...
{"excerpt": "the model writes another excerpt here"}]USER
Are there more relevant excerpts? Take care not to repeat excerpts. Also ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context.

策略：使用外部工具

策略：使用基于嵌入的搜索实现高效的知识检索

如果外部信息源作为输入的一部分，模型可以利用这些信息源。这可以帮助模型生成更明智和最新的响应。例如，如果用户询问有关特定电影的问题，将有关该电影的高质量信息（例如演员、导演等）添加到模型的输入中可能会很有用。嵌入可用于实现高效的知识检索，以便可以在运行时动态地将相关信息添加到模型输入中。

文本嵌入是一个可以测量文本字符串之间相关性的向量。相似或相关的字符串将比不相关的字符串更接近。这一事实，加上快速向量搜索算法的存在，意味着嵌入可用于实现高效的知识检索。具体而言，文本语料库可以分成块，每个块都可以嵌入和存储。然后可以嵌入给定的查询，并执行向量搜索以从语料库中找到与查询最相关的嵌入文本块（即在嵌入空间中最接近的文本块）。

可以在 OpenAI Cookbook 中找到示例实现。请参阅策略 “指示模型使用检索到的知识来回答查询”，了解如何使用知识检索来最大限度地降低模型编造错误事实的可能性的示例。

策略：使用代码执行来执行更准确的计算或调用外部 API

语言模型不能依靠自己准确地执行算术或长时间计算。在需要的情况下，可以指示模型编写和运行代码，而不是自己进行计算。具体来说，可以指示模型将要运行的代码放入指定的格式（例如三重反引号）。生成输出后，可以提取并运行代码。最后，如果需要，可以将代码执行引擎（即 Python 解释器）的输出作为模型的输入，以供下一个查询使用。

SYSTEMYou can write and execute Python code by enclosing it in triple backticks, e.g. ```code goes here```. Use this to perform calculations.USERFind all real-valued roots of the following polynomial: 3*x**5 - 5*x**4 - 3*x**3 - 7*x - 10.

代码执行的另一个好用例是调用外部 API。如果模型被指导如何正确使用 API，它就可以编写利用该 API 的代码。可以通过向模型提供文档和/或代码示例来展示如何使用 API，从而指导模型如何使用 API。

SYSTEMYou can write and execute Python code by enclosing it in triple backticks. Also note that you have access to the following module to help users send messages to their friends:\```python
import message
message.write(to="John", message="Hey, want to meetup after work?")```

警告：执行模型生成的代码并非天生安全，任何试图执行此操作的应用程序都应采取预防措施。特别是，需要沙盒代码执行环境来限制不受信任的代码可能造成的危害。

策略：系统地测试更改

有时很难判断更改（例如新指令或新设计）是使您的系统变得更好还是更糟。查看几个示例可能会提示哪个更好，但由于样本量较小，很难区分真正的改进还是随机运气。也许更改有助于提高某些输入的性能，但会损害其他输入的性能。

评估程序（或“评估”）对于优化系统设计很有用。好的评估是：

代表现实世界的使用情况（或至少是多样化的）
包含许多测试用例以获得更大的统计能力（请参阅下表了解指南）
易于自动化或重复

Difference to detect	Sample size needed for 95% confidence
30%	~10
10%	~100
3%	~1,000
1%	~10,000

输出评估可以由计算机、人类或两者混合完成。计算机可以使用客观标准（例如，只有一个正确答案的问题）以及一些主观或模糊标准自动进行评估，其中模型输出由其他模型查询进行评估。OpenAI Evals 是一个开源软件框架，提供用于创建自动评估的工具。

当存在一系列可能的输出被认为质量同样高时（例如，对于答案较长的问题），基于模型的评估会很有用。使用基于模型的评估可以实际评估的内容与需要人工评估的内容之间的界限是模糊的，并且随着模型变得越来越强大而不断变化。我们鼓励进行实验，以了解基于模型的评估对您的用例的效果如何。

战术：将模型输出与黄金标准答案进行比较

假设已知一个问题的正确答案应参考一组已知事实。然后我们可以使用模型查询来统计答案中包含所需事实的数量。

例如，使用以下系统消息：

SYSTEMYou will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:
- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.For each of these points perform the following steps:1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write "yes" if the answer to 3 was yes, otherwise write "no".Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

以下是一个示例输入，其中两个条件都得到了满足：

SYSTEM<insert system message above>USER"""Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission."""

以下是一个示例输入，其中只有一个条件得到满足：

SYSTEM<insert system message above>USER"""Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon."""

这里是一个示例输入，没有任何一个条件得到满足：

SYSTEM<insert system message above>USER"""In the summer of '69, a voyage grand,
Apollo 11, bold as legend's hand.
Armstrong took a step, history unfurled,
"One small step," he said, for a new world."""

有许多可能的变体可以用于这种基于模型的评估。考虑以下变体，它跟踪候选答案和黄金标准答案之间的重叠程度，并跟踪候选答案是否与黄金标准答案的任何部分相矛盾。

SYSTEMUse the following steps to respond to user inputs. Fully restate each step before proceeding. i.e. "Step 1: Reason...".Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

以下是示例输入，其中包含了一个不合格的答案，但这个答案并没有与专家答案相矛盾：

SYSTEM<insert system message above>USERQuestion: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""Submitted Answer: """Didn't he walk on the moon or something?"""Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

以下是一个示例输入，其中包含与专家答案直接矛盾的答案：

SYSTEM<insert system message above>USERQuestion: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""Submitted Answer: """On the 21st of July 1969, Neil Armstrong became the second person to walk on the moon, following after Buzz Aldrin."""Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

以下是一个示例输入，其中包含了一个正确的答案，并且还提供了一些不必要的细节：

SYSTEM<insert system message above>USERQuestion: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""Submitted Answer: """At approximately 02:56 UTC on July 21st 1969, Neil Armstrong became the first human to set foot on the lunar surface, marking a monumental achievement in human history."""Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

优化模型输出

随着您对提示的迭代，您将不断致力于提高 准确性、成本和延迟。以下是一些优化每个目标的技巧。

	目标	可用技巧
准确性	确保模型能够对您的提示产生准确和有用的响应。	准确的响应需要模型拥有生成响应所需的所有信息，并且知道如何进行响应的创建（从解释输入到格式化和样式）。通常，这需要结合提示工程、RAG 和模型微调的方法。了解更多关于优化准确性的信息。
成本	通过减少令牌使用量和使用可能的情况下更便宜的模式来降低使用模型的总成本。	为了控制成本，您可以尝试使用更少的令牌或更小、更便宜的模型。了解更多关于优化成本的信息。
延迟	减少对您的提示生成响应所需的时间。	优化低延迟是一个多方面的过程，包括提示工程和您自己的代码中的并行性。了解更多关于优化延迟的信息。