GPTCache:通过缓存LLM查询成本降低 10 倍,速度提高 100 倍

前言

在传统的应用开发中,为了提高系统的查询性能,一般会考虑通过多级缓存或者分布式缓存的方式来解决,大部分中间件实际上底层也支持缓存技术,只需要启用即可使用。在人工智能大语言模型领域,虽然LLM功能非常强大,但其使用成本也不低,尤其是GPT-4,加倍直接翻了几倍,在使用的时候都需要格外的谨慎。

那么,是否有一种比较好的解决方案来解决这种情况呢,既能极大的降低访问LLM API的成本,也能同时提高调用LLM API的请求效率和性能呢,GPTCache 就是专门解决这种问题而生的,对于事实性的QA问题对话,其实没必要每次都去调用模型的API获取一次结果,完全可以通过缓存调用一次即可,可大大节省成本,二而对于一些创造性比较强的场景或者应用,亦可以继续使用 temperature 参数来控制是否使用缓存还是直接调用LLM API。

一、介绍

GPTCache 是一个开源工具,旨在通过实现缓存来存储语言模型生成的响应,从而提高基于 GPT 的应用程序的效率和速度。GPTCache 是 LLM 语义缓存层(caching layer),它采用语义缓存(semantic cache)技术,能够存储 LLM 响应,GPTCache允许用户根据需要自定义缓存,包括嵌入函数,相似性评估函数,存储位置和逐出的选项。此外,GPTCache目前支持OpenAI ChatGPT接口和LangChain接口。

二、什么是语义缓存

语义缓存与传统缓存方法的不同之处在于,它存储查询或请求的含义,而不仅仅是原始数据。这样做可以通过调用以前的查询及其结果来减少服务器需要处理的查询数。传统的缓存方法根据物理特征存储数据,这可能不考虑其含义。

语义缓存根据数据的含义存储数据,这意味着即使基础数据已更改,具有相同含义的两个查询也将返回相同的结果。这对于涉及多个表或数据源的复杂查询非常有用。但是,语义缓存最显著的优点是它能够减少服务器负载。例如,通过缓存 LLM 响应,语义缓存可以缩短数据检索时间、降低 API 调用费用并提高可伸缩性。

自定义和监视缓存的性能也可以提高其效率。由于缓存存储以前的查询和结果,因此它可以快速提供查询结果,而无需处理。因此,响应时间可以更快,用户可以体验到更好的应用程序性能。

总之,语义缓存是一种强大的缓存,可以提高服务器效率和应用程序用户体验。存储查询和请求含义可以减少需要处理的查询数量,从而快速准确地提供结果。

二、为什么选择 GPTCache?

开发语义缓存(如 GPTCache)来存储大型语言模型 (LLM) 响应可以提供几个优点,例如:

2.1、改进的性能

将 LLM 响应存储在缓存中可以显著减少检索响应所需的时间,尤其是当它以前已被请求并且已存在于缓存中时。将响应存储在缓存中可以提高应用程序的整体性能。

2.2、减少开支

大多数LLM服务根据请求数量和令牌计数的组合收取费用。缓存 LLM 响应可以减少对服务的 API 调用次数,从而节省成本。缓存在处理高流量级别时尤其重要,其中 API 调用费用可能很大。

2.3、更好的可扩展性

缓存 LLM 响应可以通过减少 LLM 服务上的负载来提高应用程序的可伸缩性。缓存有助于避免瓶颈,并确保应用程序可以处理越来越多的请求。

扫描二维码关注公众号,回复: 16173910 查看本文章

2.4、最大限度地降低开发成本

语义缓存可以成为帮助降低 LLM(语言模型)应用开发阶段成本的宝贵工具。即使在开发过程中,LLM 应用程序也需要 LLM API 连接,这可能会变得昂贵。GPTCache提供与LLM API相同的接口,可以存储LLM生成或模拟的数据。GPTCache 有助于验证应用程序的功能,而无需连接到 LLM API 或网络。

2.5、减少网络延迟

位于更靠近用户的语义缓存,减少了从 LLM 服务检索数据所需的时间。通过减少网络延迟,可以改善整体用户体验。

2.6、改进的可扩展性和可用性

LLM服务经常强制实施速率限制,这是API对用户或客户端在给定时间范围内可以访问服务器的次数的限制。达到速率限制意味着将阻止其他请求,直到经过一段时间,从而导致服务中断。借助 GPTCache,您可以快速扩展以适应不断增加的查询量,从而确保随着应用程序用户群的扩展实现一致的性能。

三、GPTCache 工作原理

GPTCache 采用嵌入算法将查询转换为嵌入,并使用向量存储对这些嵌入进行相似性搜索。此过程允许 GPTCache 从缓存存储中识别和检索类似或相关的查询,如下图所示。

GPTCache采用模块化设计构建,使用户可以轻松自定义其语义缓存。每个模块都有选项供用户选择,以满足他们的需求。

  • LLM适配器:LLM适配器与不同的LLM模型集成,并在OpenAI API上标准化,统一其API和请求协议。LLM 适配器允许更轻松地对各种 LLM 模型进行实验和测试,因为您可以在它们之间切换,而无需重写代码或学习新的 API。支持可用于:

    • OpenAI ChatGPT API

    • LangChain

    • 路线图 — Hugging Face Hub, Bard, Anthropic和自托管模型,如 LLaMa

  • 嵌入生成器:嵌入生成器使用请求队列中选择的模型生成嵌入,以执行相似性搜索。支持的模型包括 OpenAI 嵌入 API。ONNX 与 GPTCache/paraphrase-albert-onnx 模型、Hugging Face嵌入 API、Cohere 嵌入 API、fastText 嵌入 API 和 SentenceTransformers 嵌入 API。

  • 缓存存储:缓存存储是存储来自LLM(如ChatGPT)的响应的地方。检索缓存的响应以帮助评估相似性,如果语义匹配良好,则返回给请求者。GPTCache支持SQLite,PostgreSQLMySQLMariaDBSQL ServerOracle。支持常用数据库意味着用户可以根据性能、可伸缩性和成本选择最适合其需求的数据库。

  • 向量存储选项:GPTCache 支持矢量存储模块,该模块有助于根据从输入请求中提取的嵌入找到 K 个最相似的请求。此功能有助于评估请求之间的相似性。此外,GPTCache提供了一个用户友好的界面,支持各种矢量存储,包括MilvusZilliz Cloud和FAISS。这些选项为用户提供了一系列向量存储选择,这可能会影响 GPTCache 中相似性搜索功能的效率和准确性。GPTCache 旨在通过支持多个矢量存储来提供灵活性并迎合更广泛的用例。

  • 逐出策略管理:GPTCache 中的缓存管理器控制缓存存储和矢量存储模块的操作。当缓存已满时,替换策略将确定要逐出哪些数据以便为新数据腾出空间。GPTCache目前支持两个基本选项:

    • LRU(最近最少使用)逐出策略

    • FIFO(先进先出)逐出策略 这两个是缓存系统中使用的标准逐出策略。

  • 相似性评估器:GPTCache 中的相似性评估器模块从缓存存储和矢量存储收集数据。它使用各种策略来确定输入请求与来自矢量存储的请求之间的相似性。相似性确定请求是否与缓存匹配。GPTCache提供了一个标准化的接口,用于集成各种相似性策略和一系列实现。这些不同的相似性策略使 GPTCache 能够根据其他用例和需求灵活地确定缓存匹配项。

GPTCache 可与您的应用程序、您首选的 LLM(ChatGPT、LangChain)、缓存存储(SQLite、PostgreSQL、MySQL、MariaDB、SQL Server 和 Oracle)和向量存储(FAISS、Milvus、Ziliz Cloud)配合使用。

四、GPTCache 案例实战

此部分将向您展示如何使用 GPT 聊天,原始示例是在 OpenAI 示例 上,不同的是我们将教您如何使用gptcache缓存精确匹配和相似匹配的响应,这将非常简单,您只需要添加额外的步骤来初始化缓存。

在运行示例之前,请确保OPENAI_API_KEY通过执行设置环境变量。如果尚未设置,可以在 Unix/Linux/MacOS 系统或Windows 系统上使用进行设置。echo $OPENAI_API_KEYexport OPENAI_API_KEY=YOUR_API_KEYset OPENAI_API_KEY=YOUR_API_KEY

那么我们可以通过下面的代码来了解gptcache的使用和加速效果,它由三部分组成,原始的openai方式、精确搜索和相似搜索。

4.1、OpenAI API 原始用法

import time
import openai


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']


question = 'what‘s github'

# OpenAI API original usage
start_time = time.time()
response = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages=[
    {
        'role': 'user',
        'content': question
    }
  ],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')

输出结果:

Question: what‘s github
Time consuming: 6.04s
Answer: GitHub is a web-based platform used for version control and collaboration of coding projects. It allows individuals and teams to store, share, and collaborate on changes to code, software, and applications. It also provides features such as issue tracking, project management tools, and code review. It is one of the most popular and widely used online platforms for open-source projects.

4.2、OpenAI API + GPTCache,精确匹配缓存

初始化缓存以运行 GPTCache 并导入openai表单gptcache.adapter,这将自动设置地图数据管理器以匹配确切的缓存,更多详细信息请参阅构建您的缓存

如果您向 ChatGPT 询问完全相同的两个问题,则将从缓存中获取第二个问题的答案,而无需再次请求 ChatGPT。

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print("Cache loading.....")

# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
# -------------------------------------------------

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
      model='gpt-3.5-turbo',
      messages=[
        {
            'role': 'user',
            'content': question
        }
      ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

输出结果:

Cache loading.....
Question: what's github
Time consuming: 6.88s
Answer: GitHub is a web-based platform that allows developers to store, share, and collaborate on programming projects. It is primarily used for version control, where developers can work on different features and changes of a project simultaneously without overwriting each other's work. GitHub also provides tools for issue tracking, code review, and project management. It is widely used in the open-source community and by software development teams in organizations of all sizes.

Question: what's github
Time consuming: 0.00s
Answer: GitHub is a web-based platform that allows developers to store, share, and collaborate on programming projects. It is primarily used for version control, where developers can work on different features and changes of a project simultaneously without overwriting each other's work. GitHub also provides tools for issue tracking, code review, and project management. It is widely used in the open-source community and by software development teams in organizations of all sizes.

4.3、OpenAI API + GPTCache,相似搜索缓存

设置缓存以embedding_func生成文本的嵌入,并data_manager管理缓存数据,similarity_evaluation评估相似性,更多详细信息请参阅构建您的缓存

从ChatGPT获得针对几个类似问题的答案后,可以从缓存中检索后续问题的答案,而无需再次请求ChatGPT。

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

print("Cache loading.....")

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
cache.set_openai_key()

questions = [
    "what's github",
    "can you explain what GitHub is",
    "can you tell me more about GitHub",
    "what is the purpose of GitHub"
]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[
            {
                'role': 'user',
                'content': question
            }
        ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

输出结果:

Cache loading.....
Question: what's github
Time consuming: 7.11s
Answer: GitHub is a web-based platform that allows developers to store, manage, review, and collaborate on code repositories. It is a version control system that enables developers to track changes they make in code over time and collaborate on projects with other developers. GitHub is used by millions of developers worldwide to share code, collaborate on open-source projects, and contribute to projects owned by others. It's also a hub for various communities and forums related to software development.

Question: can you explain what GitHub is
Time consuming: 0.19s
Answer: GitHub is a web-based platform that allows developers to store, manage, review, and collaborate on code repositories. It is a version control system that enables developers to track changes they make in code over time and collaborate on projects with other developers. GitHub is used by millions of developers worldwide to share code, collaborate on open-source projects, and contribute to projects owned by others. It's also a hub for various communities and forums related to software development.

Question: can you tell me more about GitHub
Time consuming: 0.23s
Answer: GitHub is a web-based platform that allows developers to store, manage, review, and collaborate on code repositories. It is a version control system that enables developers to track changes they make in code over time and collaborate on projects with other developers. GitHub is used by millions of developers worldwide to share code, collaborate on open-source projects, and contribute to projects owned by others. It's also a hub for various communities and forums related to software development.

Question: what is the purpose of GitHub
Time consuming: 0.21s
Answer: GitHub is a web-based platform that allows developers to store, manage, review, and collaborate on code repositories. It is a version control system that enables developers to track changes they make in code over time and collaborate on projects with other developers. GitHub is used by millions of developers worldwide to share code, collaborate on open-source projects, and contribute to projects owned by others. It's also a hub for various communities and forums related to software development.

五、OpenAI 与 temperature 聊天

深度学习中的 temperature 是通常用于调整预测输出的概率分布的参数。它也称为 softmax 温度或 softmax 缩放。简而言之,它控制神经网络预测的置信度。它有助于增加模型输出的多样性

对于 OpenAI 聊天请求中的 temperature,“较高的值(如 0.8)将使输出更加随机,而较低的值(如 0.2)将使其更加集中和确定性”,如OpenAI 文档中所述。

GPTCache 还可以根据temperature请求启用类似的参数,范围为 [0.0, 2.0],分两个阶段工作:

  • 控制直接向 OpenAI 发送请求而不在缓存中搜索的可能性

  • 影响从缓存检索到的潜在答案中选择最终答案

让我们尝试启用 GPTCache 的改编版 OpenAI Chat API,看看temperature在相同问题下如何影响输出。

5.1、设置缓存

使用首选配置和模块启动 GPTCache。

import time

from gptcache import cache, Config
from gptcache.manager import manager_factory
from gptcache.embedding import Onnx
from gptcache.processor.post import temperature_softmax
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.adapter import openai


onnx = Onnx()
data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension})

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    post_process_messages_func=temperature_softmax
    )
# cache.config = Config(similarity_threshold=0.2)

5.2、开始使用

cache.set_openai_key()
question = 'what is github'

5.2.1、默认:temperature = 0.0

如果请求中没有指定 temperature,它将使用默认值0。当 temperature 为0时,它会首先搜索缓存并返回从缓存中检索到的最可信的答案。如果缓存中没有令人满意的答案,它将继续向 OpenAI 发送请求。

for _ in range(3):
    # use cache without temperature (temperature=0.0)
    start = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{
            'role': 'user',
            'content': question
        }],
    )
    print('Time elapsed:', round(time.time() - start, 3))
    print(response["choices"][0]["message"]["content"])

输出结果:

Time elapsed: 7.906
GitHub is a web-based platform that is used to manage, store, and share software development projects. It offers a version control system and collaboration tools for developers to work together on code and other digital assets. GitHub is popular in the open-source community, but it is also used by companies to manage their proprietary code. It allows developers to easily contribute to projects, track changes, and manage project workflows. It also provides tools for issue tracking, documentation, and continuous integration and deployment.
Time elapsed: 0.22
GitHub is a web-based platform that is used to manage, store, and share software development projects. It offers a version control system and collaboration tools for developers to work together on code and other digital assets. GitHub is popular in the open-source community, but it is also used by companies to manage their proprietary code. It allows developers to easily contribute to projects, track changes, and manage project workflows. It also provides tools for issue tracking, documentation, and continuous integration and deployment.
Time elapsed: 0.239
GitHub is a web-based platform that is used to manage, store, and share software development projects. It offers a version control system and collaboration tools for developers to work together on code and other digital assets. GitHub is popular in the open-source community, but it is also used by companies to manage their proprietary code. It allows developers to easily contribute to projects, track changes, and manage project workflows. It also provides tools for issue tracking, documentation, and continuous integration and deployment.

5.2.2、最大值:temperature = 2.0

当 temperature 达到最大值2时,将跳过搜索缓存,直接向OpenAI发送请求。

# use cache with temperature 2.0
for _ in range(3):
    start = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        temperature=2.0,
        max_tokens=30,
        messages=[{
            'role': 'user',
            'content': question
        }],
    )
    print('Time elapsed:', round(time.time() - start, 3))
    print(response["choices"][0]["message"]["content"])
Time elapsed: 2.675
GitHub is a web-based platform used for version control and collaboration that helps developers store and manage their code repositories online. It allows multiple developers to work collabor
Time elapsed: 2.667
GitHub is a web-based platform used for version control and collaboration in software development projects. It provides a centralized location for developers to manage and store their code
Time elapsed: 2.56
GitHub is a web-based platform where developers can store, share, and collaborate on their code projects. It is also a version control system, meaning it

5.2.3、0.0< temperature <2.0

当 temperature 在 0 到 2 之间时,较高的值会增加跳过缓存搜索的概率,并使输出更加随机。

# use cache with temperature 1.0
for _ in range(3):
    start = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        temperature=1.0,
        messages=[{
            'role': 'user',
            'content': question
        }],
    )
    print('Time elapsed:', round(time.time() - start, 3))
    print(response["choices"][0]["message"]["content"])

Time elapsed: 0.197
GitHub is a web-based platform used for version control and collaboration that helps developers store and manage their code repositories online. It allows multiple developers to work collabor
Time elapsed: 6.116
GitHub is a web-based platform that hosts and manages software development projects using the Git version control system. It provides a collaborative environment for developers to work together on coding projects, including features such as task management, code review, and bug tracking. GitHub enables developers to share their code with the rest of the community, discover new projects and contribute to them, collaborate with others on open-source software, and showcase their work to potential employers.
Time elapsed: 6.757
GitHub is a web-based platform used for version control and collaboration of software development projects. It provides tools for developers to manage and store their code, as well as to collaborate with others through features such as pull requests, code reviews, and issue tracking. GitHub has become a popular platform for open-source projects and offers various features such as version control, documentation, bug tracking, task management, wikis, and more. It is widely used in the technology industry and by developers all over the world.

六、项目资料

  • 项目地址:https://github.com/zilliztech/GPTCache

  • 文档地址:https://gptcache.readthedocs.io/en/latest/

  • Colab示例:https://colab.research.google.com/drive/1m1s-iTDfLDk-UwUAQ_L8j1C-gzkcr2Sk?usp=share_link#scrollTo=6b3ba1cc

猜你喜欢

转载自blog.csdn.net/FrenzyTechAI/article/details/131901215