一、引言
最近,OpenAI 正式发布了 Agent 开发三剑客 —— 内置工具集、Responses API 和开源 Agents SDK,标志着 AI 智能体开发进入标准化阶段。本文将基于官方文档和最新技术动态,系统讲解如何利用这些工具快速构建具备自主决策能力的 AI 智能体。
二、核心组件解析
-
三大内置工具技术架构
- Web搜索工具:
底层采用GPT-4o的检索增强架构,支持实时网页抓取与向量数据库融合
新增引用验证模块:通过语义分析自动校验搜索结果与查询的相关性,置信度阈值可配置
典型应用场景:金融情报系统中,结合o1模型实现法律文件条款与市场数据的交叉验证,如某平台通过该工具发现收购案中的"控制权变更"条款,避免7500万美元债务风险 - 文件搜索工具:
支持混合检索模式:向量检索(基于Sentence-BERT)+元数据过滤(支持SQL-like查询语法)
集成RAG流水线:检索结果自动注入prompt,提升文档推理效率4倍以上(如BlueJ税务平台案例)
企业级部署方案:通过分布式索引实现PB级文档秒级响应,支持动态热更新 - Computer Use工具:
基于Operator技术的屏幕分析引擎:集成CV模型识别UI元素,支持跨平台操作录制/回放
键鼠操作序列优化算法:自动生成最短操作路径,减少30%以上的无效动作
典型应用:Unify ERP系统自动化,实现订单处理流程300%效率提升
- Web搜索工具:
-
Responses API架构演进
- 核心设计原则:
多轮对话状态管理:支持嵌套工具调用链,上下文传递准确率达99.2%
可观测性增强:通过tracking_id记录完整决策路径,支持生成可视化决策树
成本优化:动态模型选择策略(基于任务复杂度自动切换4o/o1/3.5) - 协议对比:
功能特性 Responses API Chat Completions API 工具调用 内置支持(3大工具+自定义) 需外部集成 多轮协作 原生支持 需开发者手动维护上下文 响应模式 流式输出+异步回调 同步返回 计费单元 token+工具调用 仅token
- 核心设计原则:
-
Agents SDK企业级特性
- 智能体编排引擎:
支持BPMN 2.0标准的工作流定义,可视化编辑工具链
动态负载均衡:根据智能体当前负载自动分配任务,吞吐量提升2.5倍
故障熔断机制:支持重试策略、降级方案与错误隔离 - 安全控制模块:
输入验证:基于正则表达式的敏感词过滤+意图分类模型
输出审查:集成RLHF价值观对齐模型,误判率<0.3%
操作审计:全链路日志追踪,支持决策路径回溯分析 - 多智能体协作模式:
- 智能体编排引擎:
三、快速上手指南
环境准备
pip install openai-agents
快速定义Agent
from agents import Agent, InputGuardrail,GuardrailFunctionOutput, Runner
from pydantic import BaseModel
import asyncio
class HomeworkOutput(BaseModel):
is_homework: bool
reasoning: str
guardrail_agent = Agent(
name="Guardrail check",
instructions="Check if the user is asking about homework.",
output_type=HomeworkOutput,
)
math_tutor_agent = Agent(
name="Math Tutor",
handoff_description="Specialist agent for math questions",
instructions="You provide help with math problems. Explain your reasoning at each step and include examples",
)
history_tutor_agent = Agent(
name="History Tutor",
handoff_description="Specialist agent for historical questions",
instructions="You provide assistance with historical queries. Explain important events and context clearly.",
)
async def homework_guardrail(ctx, agent, input_data):
result = await Runner.run(guardrail_agent, input_data, context=ctx.context)
final_output = result.final_output_as(HomeworkOutput)
return GuardrailFunctionOutput(
output_info=final_output,
tripwire_triggered=not final_output.is_homework,
)
triage_agent = Agent(
name="Triage Agent",
instructions="You determine which agent to use based on the user's homework question",
handoffs=[history_tutor_agent, math_tutor_agent],
input_guardrails=[
InputGuardrail(guardrail_function=homework_guardrail),
],
)
async def main():
result = await Runner.run(triage_agent, "who was the first president of the united states?")
print(result.final_output)
result = await Runner.run(triage_agent, "what is life")
print(result.final_output)
if __name__ == "__main__":
asyncio.run(main())
四、工具的使用
4.1 FileSearchTool
OpenAI Agent SDK 中的 FileSearchTool 是专为构建智能体(AI Agents)设计的检索工具,旨在帮助开发者快速从大量文档中提取关键信息。以下是其核心介绍:
核心功能
- 文档检索能力
支持从多种文件格式(如 PDF、Excel、Word、文本、代码文件等)中检索信息。
结合向量搜索和关键词搜索技术,精准定位相关内容。 - 高级特性
元数据过滤:通过文件属性(如创建时间、作者、标签)筛选结果。
查询优化:自动重写查询以提升准确性。
自定义排序:根据相关性或其他指标对结果排序。
直接搜索端点:可直接访问向量存储,减少模型预处理步骤。 - 集成与灵活性
无缝集成至 OpenAI 的 Responses API 和 Agents SDK,简化多工具协同。
支持与其他工具(如 Web 搜索、计算机操作工具)组合使用。
import asyncio
from agents import Agent, FileSearchTool, Runner, trace
async def main():
agent = Agent(
name="File searcher",
instructions="You are a helpful agent.",
tools=[
FileSearchTool(
max_num_results=3,
vector_store_ids=["vs_67bf88953f748191be42b462090e53e7"],
include_search_results=True,
)
],
)
with trace("File search example"):
result = await Runner.run(
agent, "Be concise, and tell me 1 sentence about Arrakis I might not know."
)
print(result.final_output)
"""
Arrakis, the desert planet in Frank Herbert's "Dune," was inspired by the scarcity of water
as a metaphor for oil and other finite resources.
"""
print("\n".join([str(out) for out in result.new_items]))
"""
{"id":"...", "queries":["Arrakis"], "results":[...]}
"""
if __name__ == "__main__":
asyncio.run(main())
4.2 WebSearchTool
OpenAI Agent SDK中的WebSearchTool是一款基于ChatGPT同款搜索技术的实时网络检索工具,支持多轮对话和复杂查询,能为开发者提供带引用来源的准确信息。该工具可通过Responses API或Agents SDK无缝集成,默认支持gpt-4 o
和gpt-4 o-mini
模型,在聊天补全API中则需使用专用模型gpt-4 o-search-preview
和gpt-4 o-mini-search-preview
。它无需额外配置即可默认嵌入智能体,支持与文件搜索、计算机操作等工具协同工作,适用于实时问答、动态数据分析、内容生成等场景。目前处于预览阶段,检索费用按输入Token计费,未来可能调整定价。开发者可通过Python代码或REST API调用,结合自定义参数如用户地理位置和搜索上下文大小优化结果,同时需注意合规性和预览阶段的功能限制。该工具的推出显著增强了智能体处理实时信息的能力,推动了AI代理在电商、金融、研究等领域的应用。
import asyncio
from agents import Agent, Runner, WebSearchTool, trace
async def main():
agent = Agent(
name="Web searcher",
instructions="You are a helpful agent.",
tools=[WebSearchTool(user_location={
"type": "approximate", "city": "New York"})],
)
with trace("Web search example"):
result = await Runner.run(
agent,
"search the web for 'local sports news' and give me 1 interesting update in a sentence.",
)
print(result.final_output)
# The New York Giants are reportedly pursuing quarterback Aaron Rodgers after his ...
if __name__ == "__main__":
asyncio.run(main())
4.3 Response API
OpenAI API 为最先进的 AI模型提供了一个简单的接口,用于文本生成、自然语言处理、计算机视觉等
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4o",
input="Write a one-sentence bedtime story about a unicorn."
)
print(response.output_text)
4.4 Computer Use Tool
导入模块
import asyncio
import base64
from typing import Literal, Union
from playwright.async_api import Browser, Page, Playwright, async_playwright
from agents import (
Agent,
AsyncComputer,
Button,
ComputerTool,
Environment,
ModelSettings,
Runner,
trace,
)
asyncio
:用于实现异步编程。base64
:用于对图片数据进行 Base64 编码。typing
模块中的Literal
和Union
:用于类型注解。playwright.async_api
:提供了异步的 Playwright API,用于自动化浏览器操作。agents
模块:包含自定义的代理、工具等类。
主函数 main
async def main():
async with LocalPlaywrightComputer() as computer:
with trace("Computer use example"):
agent = Agent(
name="Browser user",
instructions="You are a helpful agent.",
tools=[ComputerTool(computer)],
# Use the computer using model, and set truncation to auto because its required
model="computer-use-preview",
model_settings=ModelSettings(truncation="auto"),
)
result = await Runner.run(agent, "Search for SF sports news and summarize.")
print(result.final_output)
- 运用
async with
语句创建LocalPlaywrightComputer
实例。 - 创建一个
Agent
实例,该实例具备名称、指令、工具和模型设置。 - 借助
Runner.run
方法让代理执行搜索旧金山体育新闻并总结的任务。 - 打印最终结果。
键映射字典 CUA_KEY_TO_PLAYWRIGHT_KEY
CUA_KEY_TO_PLAYWRIGHT_KEY = {
"/": "Divide",
"\\": "Backslash",
"alt": "Alt",
# 其他键映射...
}
此字典把自定义的键名映射到 Playwright 所支持的键名。
LocalPlaywrightComputer
类
class LocalPlaywrightComputer(AsyncComputer):
"""A computer, implemented using a local Playwright browser."""
def __init__(self):
self._playwright: Union[Playwright, None] = None
self._browser: Union[Browser, None] = None
self._page: Union[Page, None] = None
- 继承自
AsyncComputer
类,利用本地 Playwright 浏览器来实现计算机功能。 __init__
方法对_playwright
、_browser
和_page
属性进行初始化。
_get_browser_and_page
方法
async def _get_browser_and_page(self) -> tuple[Browser, Page]:
width, height = self.dimensions
launch_args = [f"--window-size={
width},{
height}"]
browser = await self.playwright.chromium.launch(headless=False, args=launch_args)
page = await browser.new_page()
await page.set_viewport_size({
"width": width, "height": height})
await page.goto("https://www.bing.com")
return browser, page
- 启动 Chromium 浏览器,创建新页面。
- 设置页面视口大小。
- 导航到 Bing 搜索页面。
__aenter__
和 __aexit__
方法
async def __aenter__(self):
self._playwright = await async_playwright().start()
self._browser, self._page = await self._get_browser_and_page()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self._browser:
await self._browser.close()
if self._playwright:
await self._playwright.stop()
__aenter__
方法:启动 Playwright 并获取浏览器和页面。__aexit__
方法:关闭浏览器并停止 Playwright。
属性方法
@property
def playwright(self) -> Playwright:
assert self._playwright is not None
return self._playwright
@property
def browser(self) -> Browser:
assert self._browser is not None
return self._browser
@property
def page(self) -> Page:
assert self._page is not None
return self._page
@property
def environment(self) -> Environment:
return "browser"
@property
def dimensions(self) -> tuple[int, int]:
return (1024, 768)
- 这些属性方法用于获取 Playwright、浏览器、页面、环境和页面尺寸。
操作方法
async def screenshot(self) -> str:
png_bytes = await self.page.screenshot(full_page=False)
return base64.b64encode(png_bytes).decode("utf-8")
async def click(self, x: int, y: int, button: Button = "left") -> None:
playwright_button: Literal["left", "middle", "right"] = "left"
if button in ("left", "right", "middle"):
playwright_button = button # type: ignore
await self.page.mouse.click(x, y, button=playwright_button)
# 其他操作方法...
- 这些方法实现了截图、点击、双击、滚动、输入、等待、移动、按键和拖动等操作。
完整代码:
import asyncio
import base64
from typing import Literal, Union
from playwright.async_api import Browser, Page, Playwright, async_playwright
from agents import (
Agent,
AsyncComputer,
Button,
ComputerTool,
Environment,
ModelSettings,
Runner,
trace,
)
# Uncomment to see very verbose logs
# import logging
# logging.getLogger("openai.agents").setLevel(logging.DEBUG)
# logging.getLogger("openai.agents").addHandler(logging.StreamHandler())
async def main():
async with LocalPlaywrightComputer() as computer:
with trace("Computer use example"):
agent = Agent(
name="Browser user",
instructions="You are a helpful agent.",
tools=[ComputerTool(computer)],
# Use the computer using model, and set truncation to auto because its required
model="computer-use-preview",
model_settings=ModelSettings(truncation="auto"),
)
result = await Runner.run(agent, "Search for SF sports news and summarize.")
print(result.final_output)
CUA_KEY_TO_PLAYWRIGHT_KEY = {
"/": "Divide",
"\\": "Backslash",
"alt": "Alt",
"arrowdown": "ArrowDown",
"arrowleft": "ArrowLeft",
"arrowright": "ArrowRight",
"arrowup": "ArrowUp",
"backspace": "Backspace",
"capslock": "CapsLock",
"cmd": "Meta",
"ctrl": "Control",
"delete": "Delete",
"end": "End",
"enter": "Enter",
"esc": "Escape",
"home": "Home",
"insert": "Insert",
"option": "Alt",
"pagedown": "PageDown",
"pageup": "PageUp",
"shift": "Shift",
"space": " ",
"super": "Meta",
"tab": "Tab",
"win": "Meta",
}
class LocalPlaywrightComputer(AsyncComputer):
"""A computer, implemented using a local Playwright browser."""
def __init__(self):
self._playwright: Union[Playwright, None] = None
self._browser: Union[Browser, None] = None
self._page: Union[Page, None] = None
async def _get_browser_and_page(self) -> tuple[Browser, Page]:
width, height = self.dimensions
launch_args = [f"--window-size={
width},{
height}"]
browser = await self.playwright.chromium.launch(headless=False, args=launch_args)
page = await browser.new_page()
await page.set_viewport_size({
"width": width, "height": height})
await page.goto("https://www.bing.com")
return browser, page
async def __aenter__(self):
# Start Playwright and call the subclass hook for getting browser/page
self._playwright = await async_playwright().start()
self._browser, self._page = await self._get_browser_and_page()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self._browser:
await self._browser.close()
if self._playwright:
await self._playwright.stop()
@property
def playwright(self) -> Playwright:
assert self._playwright is not None
return self._playwright
@property
def browser(self) -> Browser:
assert self._browser is not None
return self._browser
@property
def page(self) -> Page:
assert self._page is not None
return self._page
@property
def environment(self) -> Environment:
return "browser"
@property
def dimensions(self) -> tuple[int, int]:
return (1024, 768)
async def screenshot(self) -> str:
"""Capture only the viewport (not full_page)."""
png_bytes = await self.page.screenshot(full_page=False)
return base64.b64encode(png_bytes).decode("utf-8")
async def click(self, x: int, y: int, button: Button = "left") -> None:
playwright_button: Literal["left", "middle", "right"] = "left"
# Playwright only supports left, middle, right buttons
if button in ("left", "right", "middle"):
playwright_button = button # type: ignore
await self.page.mouse.click(x, y, button=playwright_button)
async def double_click(self, x: int, y: int) -> None:
await self.page.mouse.dblclick(x, y)
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
await self.page.mouse.move(x, y)
await self.page.evaluate(f"window.scrollBy({
scroll_x}, {
scroll_y})")
async def type(self, text: str) -> None:
await self.page.keyboard.type(text)
async def wait(self) -> None:
await asyncio.sleep(1)
async def move(self, x: int, y: int) -> None:
await self.page.mouse.move(x, y)
async def keypress(self, keys: list[str]) -> None:
for key in keys:
mapped_key = CUA_KEY_TO_PLAYWRIGHT_KEY.get(key.lower(), key)
await self.page.keyboard.press(mapped_key)
async def drag(self, path: list[tuple[int, int]]) -> None:
if not path:
return
await self.page.mouse.move(path[0][0], path[0][1])
await self.page.mouse.down()
for px, py in path[1:]:
await self.page.mouse.move(px, py)
await self.page.mouse.up()
if __name__ == "__main__":
asyncio.run(main())
此代码实现了一个自动化浏览器操作的异步程序,借助 Playwright 库实现了浏览器的各种操作,让代理能够在浏览器中完成搜索和总结任务。