目录
前言
基于这个博客:Rag应用——多路召回检索-CSDN博客编写的colab版的详细步骤,并调整了相关代码
下载资源
先在百度网盘上下载好资源:https://pan.baidu.com/s/108y-BUhiNBhG22k9yGP6Wg?pwd=6666
获取千帆API Key、Secret Key
参考这个博客的第二步:Python—使用LangChain调用千帆大模型_langchat + 千帆-CSDN博客
步骤
在谷歌网盘中上传压缩包bge-large-zh-v1.5.zip
进入到colab
默认情况下,它使用的是CPU,我们需要改成GPU:
选择连接到托管运行:
挂载谷歌网盘
import os
from google.colab import drive
drive.mount('/content/drive')
进入colab目录
%cd /content/drive/MyDrive/colab
解压bge-large-zh-v1.5.zip
!unzip bge-large-zh-v1.5.zip
下载对应的包
!pip install langchain_community
!pip install jieba
!pip install langchain_huggingface
!pip install rank_bm25
!pip install sentence-transformers
!pip install HuggingFaceEmbeddings
!pip install langchain
!pip install faiss-gpu
!pip install langchain_openai
!pip install modelscope
!pip install qianfan
因为colab是在Python代码环境, Python 代码环境中不能直接使用了pip
命令。pip
命令应该在终端或命令提示符中运行。
如果你想在 Jupyter Notebook 中安装包,需要使用 !
来执行 shell 命令。
上传书籍.json文件(记得在上传之前重命名为book.json)
from google.colab import files
uploaded = files.upload()
运行代码
from langchain_community.retrievers import BM25Retriever
from typing import List
import jieba
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
# 加载文档
loader = TextLoader('book.json')
documents = loader.load()
# 文本分割
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0,
length_function=len,
separators=['{']
)
docs = text_splitter.split_documents(documents)
# 文本预处理
def preprocessing_func(text: str) -> List[str]:
return list(jieba.cut(text))
# 初始化 BM25 检索器
bm25 = BM25Retriever(docs=docs, k=10)
print(bm25.k)
# 获取用户输入
user_question = input("请输入您的问题:")
# 检索结果
retriever = bm25.from_documents(docs, preprocess_func=preprocessing_func)
retriever.invoke(user_question)
from rank_bm25 import BM25Okapi
texts = [i.page_content for i in docs]
texts_processed = [preprocessing_func(t) for t in texts]
vectorizer = BM25Okapi(texts_processed)
bm25_res = vectorizer.get_top_n(preprocessing_func(user_question), texts, n=10)
print("BM25 检索结果:", bm25_res)
# 导入 FAISS
from langchain.vectorstores import FAISS
embeddings = HuggingFaceEmbeddings(model_name='./bge-large-zh-v1.5', model_kwargs={'device': 'cuda:0'})
db = FAISS.from_documents(docs, embeddings)
vector_res = db.similarity_search(user_question, k=10)
def rrf(vector_results: List[str], text_results: List[str], k: int=10, m: int=60):
doc_scores = {}
for rank, doc_id in enumerate(vector_results):
doc_scores[doc_id] = doc_scores.get(doc_id, 0) + 1 / (rank + m)
for rank, doc_id in enumerate(text_results):
doc_scores[doc_id] = doc_scores.get(doc_id, 0) + 1 / (rank + m)
sorted_results = [d for d, _ in sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)[:k]]
return sorted_results
vector_results = [i.page_content for i in vector_res]
text_results = [i for i in bm25_res]
rrf_res = rrf(vector_results, text_results)
prompt = '''
任务目标:根据检索出的文档回答用户问题
任务要求:
1、不得脱离检索出的文档回答问题
2、若检索出的文档不包含用户问题的答案,请回答我不知道
用户问题:
{}
检索出的文档:
{}
'''
import os
from langchain_community.chat_models import QianfanChatEndpoint
from langchain_core.messages import HumanMessage
# 设置环境变量
os.environ["QIANFAN_AK"] = "" # 替换为你的api key
os.environ["QIANFAN_SK"] = "" # 替换为你的secret key
# 初始化聊天端点
chat = QianfanChatEndpoint(streaming=True)
# 使用用户输入的提问
res = chat.invoke(prompt.format(user_question, ''.join(rrf_res)))
print(res.content)
# 额外调用聊天 API
res = chat.invoke(user_question)
print(res.content)