基于 Python 的自然语言处理系列（61）：RAG Fusion介绍

RAG Fusion 是一种检索方法，旨在弥合传统搜索范式与人类查询的多维特性之间的差距。本项目受 Retrieval Augmented Generation (RAG) 的启发，进一步采用 多查询生成 和 互惠排序融合 (Reciprocal Rank Fusion, RRF) 来重新排名搜索结果，以提升检索效果。

本实现基于该 GitHub 仓库进行重构，所有贡献归原作者所有。

环境设置

本示例使用 Pinecone 作为向量数据库，并构造一组示例数据。

import pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pineconepinecone.init(api_key="...", environment="...")all_documents = {"doc1": "Climate change and economic impact.","doc2": "Public health concerns due to climate change.","doc3": "Climate change: A social perspective.","doc4": "Technological solutions to climate change.","doc5": "Policy changes needed to combat climate change.","doc6": "Climate change and its impact on biodiversity.","doc7": "Climate change: The science and models.","doc8": "Global warming: A subset of climate change.","doc9": "How climate change affects daily weather.","doc10": "The history of climate change activism."
}vectorstore = Pinecone.from_texts(list(all_documents.values()), OpenAIEmbeddings(), index_name="rag-fusion"")

查询生成器

我们将定义一个 LangChain 查询生成链，该链可以基于单个查询生成多个相关查询。

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain import hubprompt = hub.pull("langchain-ai/rag-fusion-query-generation")generate_queries = (prompt | ChatOpenAI(temperature=0) | StrOutputParser() | (lambda x: x.split("\n"))
)

定义完整检索链

该检索链的执行流程如下：

生成多个查询。
使用检索器查询每个子查询。
使用 互惠排序融合 (RRF) 对结果重新排名。

注意：此过程不会执行最终的生成步骤，仅进行检索和融合。

original_query = "impact of climate change"
vectorstore = Pinecone.from_existing_index("rag-fusion", OpenAIEmbeddings())
retriever = vectorstore.as_retriever()from langchain.load import dumps, loadsdef reciprocal_rank_fusion(results: list[list], k=60):fused_scores = {}for docs in results:# 假设检索结果已按相关性排序for rank, doc in enumerate(docs):doc_str = dumps(doc)if doc_str not in fused_scores:fused_scores[doc_str] = 0fused_scores[doc_str] += 1 / (rank + k)reranked_results = [(loads(doc), score)for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)]return reranked_resultschain = generate_queries | retriever.map() | reciprocal_rank_fusion# 执行查询
results = chain.invoke({"original_query": original_query})