当前位置：首页 > news >正文

构建完整的RAG生态系统并优化每个组件

news 2025/9/15 5:48:51

本篇文章Building the Entire RAG Ecosystem and Optimizing Every Component详细介绍了RAG（检索增强生成）系统的构建与优化。文章的技术亮点在于全面覆盖了RAG的各个组件，如查询转换、智能路由、索引和自我纠错流等，提供了丰富的代码示例，便于读者理解和实现。

文章目录

1 理解基础RAG系统
- 1.1 索引阶段
- 1.2 检索
- 1.3 生成
2 高级查询转换
- 2.1 多查询生成
- 2.2 RAG-Fusion
- 2.3 分解
- 2.4 回溯式提示
- 2.5 HyDE
3 路由与查询构建
- 3.1 逻辑路由
- 3.2 语义路由
- 3.3 查询结构化
4 高级索引策略
- 4.1 多表示索引
- 4.2 分层索引（RAPTOR）知识树
- 4.3 令牌级精度（ColBERT）
5 高级检索与生成
- 5.1 专用重排
- 5.2 使用AI智能体进行自校正
- 5.3 长上下文的影响
6 手动RAG评估
- 6.1 核心指标：我们应该衡量什么？
- 6.2 使用LangChain从零构建评估器
7 使用框架进行评估
- 7.1 使用`deepeval`进行快速评估
- 7.2 使用`grouse`的另一个强大替代方案
- 7.3 使用`RAGAS`进行评估
8 总结所有内容

大多数团队在基于自身数据构建生产级RAG系统时，都会经历多轮实验，并依赖多个不同的组件，每个组件都需要独立的设置、调优和细致处理。这些组件包括…

生产级RAG系统

查询转换（Query Transformations）：重写用户问题，使其更有效地进行检索。
智能路由（Intelligent Routing）：将查询导向正确的数据源或专用工具。
索引（Indexing）：创建多层知识库。
检索与重排（Retrieval and Re-ranking）：过滤噪音并优先处理最相关的上下文。
自校正智能体流程（Self-Correcting Agentic Flows）：构建能够评估和改进自身工作的系统。
端到端评估（End-to-End Evaluation）：客观衡量整个管道的性能。

还有更多…

我们将通过可视化图表，从基础到高级技术，学习并编写RAG生态系统的每个部分，以便于理解。

所有代码（理论+Jupyter Notebook）都可以在我的GitHub仓库中找到：rag-ecosystem

我的目录分为几个部分。请看。

1 理解基础RAG系统

在我们深入了解RAG基础知识之前，我们需要设置用于跟踪和其他任务的环境变量，例如我们将使用的LLM API提供商。

import osos.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>os.environ['OPENAI_API_KEY'] = <your-api-key>

您可以从LangSmith官方文档获取您的LangSmith API密钥，以在此博客中跟踪我们的RAG产品。对于LLM，我们将使用OpenAI API，但您可能已经知道，LangChain也支持各种LLM提供商。

核心RAG管道是任何高级系统的基础，理解其组件至关重要。因此，在深入了解高级组件的细节之前，我们首先需要理解RAG系统的工作核心逻辑，但如果您已经了解RAG系统的工作原理，可以跳过此部分。

基础RAG系统

这个最简单的RAG可以分解为三个组件：

索引（Indexing）：以结构化格式组织和存储数据，以实现高效搜索。
检索（Retrieval）：根据查询或输入搜索并获取相关数据。
生成（Generation）：使用检索到的数据创建最终响应或输出。

让我们从头开始构建这个简单的管道，看看每个部分是如何工作的。

1.1 索引阶段

在我们的RAG系统能够回答任何问题之前，它需要知识来获取信息。为此，我们将使用WebBaseLoader直接从Lilian Weng的优秀博客文章中提取关于LLM驱动智能体的内容。
在这里插入图片描述

索引阶段

import bs4
from langchain_community.document_loaders import WebBaseLoaderloader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
)docs = loader.load()

bs_kwargs参数帮助我们只针对相关的HTML标签（post-content、post-title等），从一开始就清理我们的数据。

现在我们有了文档，我们面临第一个挑战。由于上下文窗口限制，将一个巨大的文档直接输入到LLM中效率低下，而且通常是不可能的。

这就是为什么**分块（chunking）**是一个关键步骤。我们需要将文档分解成更小、语义上有意义的片段。

RecursiveCharacterTextSplitter是完成这项工作的推荐工具，因为它智能地尝试保持段落和句子完整。

from langchain.text_splitter import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)splits = text_splitter.split_documents(docs)

通过chunk_size=1000，我们创建了1000个字符的块，而chunk_overlap=200确保了它们之间存在一定的连续性，这有助于保留上下文。

我们的文本现在已经分块，但它仍然只是文本。为了执行相似性搜索，我们需要将这些块转换为称为**嵌入（embeddings）的数值表示。然后，我们将这些嵌入存储在向量存储（vector store）**中，这是一个专门用于高效搜索向量的数据库。

Chroma向量存储和OpenAIEmbeddings使这一切变得异常简单。以下一行代码同时处理了嵌入和索引。

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddingsvectorstore = Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings()
)

随着我们的知识被索引，我们现在可以开始提问了。

1.2 检索

向量存储是我们的图书馆，而**检索器（retriever）**是我们的智能图书管理员。它接收用户的查询，对其进行嵌入，然后从向量存储中获取语义上最相似的块。

在这里插入图片描述

检索阶段

从我们的vectorstore创建检索器只需一行代码。

retriever = vectorstore.as_retriever()

让我们测试一下。我们将提出一个问题，看看我们的检索器能找到什么。

docs = retriever.get_relevant_documents("What is Task Decomposition?")print(docs[0].page_content)

Task decomposition can be done (1) by LLM with simple prompting ...
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple ...

正如您所看到的，检索器成功地从博客文章中提取了直接讨论“任务分解”的最相关片段。这段上下文正是LLM形成准确答案所需的。

1.3 生成

我们有了上下文，但我们需要一个LLM来阅读它并形成一个用户友好的答案。这就是RAG中的**“生成（Generation）”**步骤。

生成步骤

首先，我们需要一个好的提示模板。这会指导LLM如何表现。我们不必自己编写，可以从LangChain Hub中拉取一个预优化的模板。

from langchain import hubprompt = hub.pull("rlm/rag-prompt")print(prompt)

human
You are an assistant for question-answering tasks. Use the following pieces
of retrieved context to answer the question. If you dont know the answer,
just say that you dont know. Use three sentences maximum and keep the
answer concise.Question: {question}
Context: {context}
Answer:

接下来，我们初始化我们的LLM。我们将使用gpt-3.5-turbo。

from langchain_openai import ChatOpenAIllm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

现在是最后一步：将所有内容串联起来。使用LangChain表达式语言（LCEL），我们可以将一个组件的输出传递给下一个组件的输入。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthroughdef format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()}| prompt| llm| StrOutputParser()
)

让我们分解这个链条：

{"context": retriever | format_docs, "question": RunnablePassthrough()}：这部分并行运行。它将用户的查询发送给retriever以获取文档，然后由format_docs格式化成单个字符串。同时，RunnablePassthrough将原始问题不变地传递下去。
| prompt：上下文和问题被输入到我们的提示模板中。
| llm：格式化后的提示被发送到LLM。
| StrOutputParser()：这会将LLM的输出清理成一个简单的字符串。

现在，让我们调用整个链条。

response = rag_chain.invoke("What is Task Decomposition?")
print(response)

Task decomposition is a technique used to break down large tasks
into smaller, more manageable subgoals. This can be achieved by using a
Large Language Model (LLM) with simple prompts, task-specific instructions,
or human inputs. For example, ...

我们成功地检索了关于**“任务分解”**的相关信息，并用它生成了一个简洁、准确的答案。这个简单的链条构成了我们构建更高级、更强大功能的基础。

2 高级查询转换

现在我们理解了RAG管道的基础知识。但生产系统常常暴露出这种基本方法的局限性。最常见的失败点之一就是用户查询本身。

在这里插入图片描述

查询转换

查询可能过于具体、过于宽泛，或者使用的词汇与我们的源文档不同，从而导致检索结果不佳。

解决方案不是责怪用户，而是让我们的系统更智能。**查询转换（Query Transformation）**是一组强大的技术，旨在重写、扩展或分解原始问题，以显著提高检索准确性。

我们将不再依赖单一查询，而是设计多个更完善的查询，以更广泛、更准确地进行搜索。

为了测试这些新技术，我们将使用与之前基础RAG管道部分相同的索引知识库。这确保我们可以直接比较结果并看到改进。

快速回顾一下，我们是如何设置检索器的：

loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
)
blog_docs = loader.load()text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300,chunk_overlap=50
)
splits = text_splitter.split_documents(blog_docs)vectorstore = Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings())retriever = vectorstore.as_retriever()

现在，检索器已准备就绪，让我们探索第一个查询转换技术。

2.1 多查询生成

单个用户查询只代表一个视角。基于距离的相似性搜索可能会遗漏使用同义词或讨论相关概念的相关文档。

多查询方法通过使用LLM生成用户问题的几个不同版本来解决这个问题，从而有效地从多个角度进行搜索。

多查询优化

我们将首先创建一个提示，指示LLM生成这些替代问题。

from langchain.prompts import ChatPromptTemplatetemplate = """You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)generate_queries = (prompt_perspectives| ChatOpenAI(temperature=0)| StrOutputParser()| (lambda x: x.split("\n"))
)

让我们测试这个链条，看看它为我们的问题生成了什么样的查询。

question = "What is task decomposition for LLM agents?"
generated_queries_list = generate_queries.invoke({"question": question})for i, q in enumerate(generated_queries_list):print(f"{i+1}. {q}")

1. How can LLM agents break down complex tasks?
2. What is the process of task decomposition in the context of large language model agents?
3. What are the methods for decomposing tasks for LLM-powered agents?
4. Explain the concept of task decomposition as it applies to AI agents using LLMs.
5. In what ways do LLM agents handle task decomposition?

这非常出色。LLM使用不同的关键词，如“分解复杂任务”、“方法”和“过程”重新表述了我们的原始问题。现在，我们可以为所有这些查询检索文档并合并结果。一种简单的合并方法是获取所有检索到的文档的唯一集合。

from langchain.load import dumps, loads
def get_unique_union(documents: list[list]):""" A simple function to get the unique union of retrieved documents """flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]unique_docs = list(set(flattened_docs))return [loads(doc) for doc in unique_docs]retrieval_chain = generate_queries | retriever.map() | get_unique_uniondocs = retrieval_chain.invoke({"question": question})
print(f"Total unique documents retrieved: {len(docs)}")

Total unique documents retrieved: 6

通过使用五个不同的查询进行搜索，我们总共检索了6个独特的文档，这可能比单一查询捕获了更全面的信息集。现在我们可以将此上下文输入到我们的最终RAG链条中。

from operator import itemgettertemplate = """Answer the following question based on this context:{context}Question: {question}
"""prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0)final_rag_chain = ({"context": retrieval_chain, "question": itemgetter("question")}| prompt| llm| StrOutputParser()
)final_rag_chain.invoke({"question": question})

Task decomposition for LLM agents involves breaking down large,
complex tasks into smaller, more manageable sub-goals. This allows
the agent to work through a problem systematically. Methods for
decomposition include using the LLM itself with simple prompts ...

这个答案更具鲁棒性，因为它基于更广泛的相关文档池。

2.2 RAG-Fusion

多查询是一个很好的开始，但简单地合并文档会将它们一视同仁。如果一个文档被我们的三个查询高度排名，而另一个文档只是一个查询的低排名结果呢？

在这里插入图片描述

RAG Fusion

第一个显然更重要。RAG-Fusion在多查询的基础上进行了改进，它不仅获取文档，还…

使用一种称为倒数排名融合（Reciprocal Rank Fusion, RRF）的技术对它们进行重排。

RRF智能地结合了多个排名列表的结果。它会提高在不同结果列表中始终排名靠前的文档的分数，从而将最相关的内容推到顶部。

代码非常相似，但我们将用RRF实现替换get_unique_union函数。

def reciprocal_rank_fusion(results: list[list], k=60):""" Reciprocal Rank Fusion that intelligently combines multiple ranked lists """fused_scores = {}for docs in results:for rank, doc in enumerate(docs):doc_str = dumps(doc)if doc_str not in fused_scores:fused_scores[doc_str] = 0fused_scores[doc_str] += 1 / (rank + k)reranked_results = [(loads(doc), score)for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)]return reranked_results

上述函数将在通过相似性搜索获取文档后对其进行重排，但我们尚未初始化它，所以现在就来做。

template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)generate_queries = (prompt_rag_fusion| ChatOpenAI(temperature=0)| StrOutputParser()| (lambda x: x.split("\n"))
)retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})print(f"Total re-ranked documents retrieved: {len(docs)}")

Total re-ranked documents retrieved: 7

最终的链条保持不变，但现在它接收到一个更智能排序的上下文。RAG-Fusion是一种强大、低成本的方法，可以提高检索质量。

2.3 分解

有些问题过于复杂，无法一步回答。例如，**“LLM驱动智能体的主要组件是什么，它们如何交互？”**这实际上是两个问题合二为一。

在这里插入图片描述

递归回答

分解技术使用LLM将复杂查询分解为一组更简单、独立的子问题。然后我们可以逐一回答这些子问题，并综合出最终答案。

我们将从为此目的设计的提示开始。

template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)generate_queries_decomposition = (prompt_decomposition| ChatOpenAI(temperature=0)| StrOutputParser()| (lambda x: x.split("\n"))
)question = "What are the main components of an LLM-powered autonomous agent system?"
sub_questions = generate_queries_decomposition.invoke({"question": question})
print(sub_questions)

['1. What are the core components ... agent?','2. How is memory implemented in LLM-po ... agents?','3. What role does planning and task decomposition ... LLMs?'
]

LLM成功地分解了我们的复杂问题。现在，我们可以单独回答这些问题并结合结果。一种有效的方法是回答每个子问题，并使用生成的问答对作为上下文来综合出最终的、全面的答案。

prompt_rag = hub.pull("rlm/rag-prompt")rag_results = []
for sub_question in sub_questions:retrieved_docs = retriever.get_relevant_documents(sub_question)answer = (prompt_rag | llm | StrOutputParser()).invoke({"context": retrieved_docs, "question": sub_question})rag_results.append(answer)def format_qa_pairs(questions, answers):"""Format Q and A pairs"""formatted_string = ""for i, (question, answer) in enumerate(zip(questions, answers), start=1):formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"return formatted_string.strip()context = format_qa_pairs(sub_questions, rag_results)template = """Here is a set of Q+A pairs:{context}Use these to synthesize an answer to the original question: {question}
"""prompt = ChatPromptTemplate.from_template(template)final_rag_chain = (prompt| llm| StrOutputParser()
)final_rag_chain.invoke({"context": context, "question": question})

An LLM-powered autonomous agent system primarily consists of three
core components: planning, memory, and tool use. ...

通过分解问题，我们构建了一个比其他方式更详细、更有条理的答案。

2.4 回溯式提示

有时，用户的查询过于具体，而我们的文档包含回答该问题所需的更通用、更基础的信息。

在这里插入图片描述

回溯式提示

例如，用户可能会问：“警察乐队的成员可以合法逮捕吗？”

直接搜索可能会失败。回溯式（Step-Back）技术使用LLM“退一步”形成一个更通用的问题，例如“警察乐队的权力与职责是什么？”然后我们为特定问题和通用问题都检索上下文，为最终答案提供更丰富的上下文。

我们可以使用少量示例来教授LLM这种模式。

from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplateexamples = [{"input": "Could the members of The Police perform lawful arrests?","output": "what can the members of The Police do?",},{"input": "Jan Sindel's was born in what country?","output": "what is Jan Sindel's personal history?",},
]example_prompt = ChatPromptTemplate.from_messages([("human", "{input}"),("ai", "{output}")
])few_shot_prompt = FewShotChatMessagePromptTemplate(example_prompt=example_prompt,examples=examples,
)prompt = ChatPromptTemplate.from_messages([("system","You are an expert at world knowledge. Your task is to step back and paraphrase a question ""to a more generic step-back question, which is easier to answer. Here are a few examples:"),few_shot_prompt,("user", "{question}"),
])

现在，我们可以简单地定义回溯方法的链条，所以我们来做。

generate_queries_step_back = prompt | ChatOpenAI(temperature=0) | StrOutputParser()question = "What is task decomposition for LLM agents?"
step_back_question = generate_queries_step_back.invoke({"question": question})print(f"Original Question: {question}")
print(f"Step-Back Question: {step_back_question}")

Original Question: What is task decomposition for LLM agents?
Step-Back Question: What are the different approaches to task decompositionin software engineering?

这是一个重要的回溯问题。它将范围扩大到通用软件工程，这可能会引入基础文档，然后可以将其与关于LLM智能体的特定上下文结合起来。现在我们可以构建一个同时使用两者的链条。

from langchain_core.runnables import RunnableLambda# Prompt for the final response
response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.# Normal Context
{normal_context}# Step-Back Context
{step_back_context}# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)# The full chain
chain = ({# Retrieve context using the normal question"normal_context": RunnableLambda(lambda x: x["question"]) | retriever,# Retrieve context using the step-back question"step_back_context": generate_queries_step_back | retriever,# Pass on the original question"question": lambda x: x["question"],}| response_prompt| ChatOpenAI(temperature=0)| StrOutputParser()
)chain.invoke({"question": question})

当我们使用此回溯提示链运行我们的查询时，我们得到以下输出。

Task decomposition is a fundamental concept in software engineering
where a complex problem is broken down into smaller, more manageable parts.
In the context of LLM agents, this principle is applied to enable them to
handle large tasks. By decomposing a task into sub-goa ....

2.5 HyDE

这项最终技术是最巧妙的之一。检索的核心问题是用户的查询可能与文档使用不同的词汇（“词汇不匹配”问题）。

HyDE

**HyDE（假设文档嵌入）**提出了一种激进的解决方案：首先，让LLM生成一个问题的假设性答案。这个虚假文档虽然在事实上不完全正确，但将具有丰富的语义，并使用我们期望在真实答案中找到的语言。

然后，我们嵌入这个假设文档，并使用其嵌入来执行检索。结果是，我们找到了与理想答案语义上非常相似的真实文档。

让我们首先创建一个提示来生成这个假设文档。

template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)generate_docs_for_retrieval = (prompt_hyde| ChatOpenAI(temperature=0)| StrOutputParser()
)hypothetical_document = generate_docs_for_retrieval.invoke({"question": question})
print(hypothetical_document)

Task decomposition in large language model (LLM) agents refers to the
process of breaking down a complex, high-level task ...

这段文字是一个完美的教科书式答案。现在，我们使用它的嵌入来查找真实文档。

retrieval_chain = generate_docs_for_retrieval | retriever
retrieved_docs = retrieval_chain.invoke({"question": question})final_rag_chain.invoke({"context": retrieved_docs, "question": question})

Task decomposition for LLM agents involves breaking down a larger task
into smaller, more manageable subgoals. This can be done using techni ...

通过使用假设文档作为诱饵，HyDE帮助我们精确地定位知识库中最相关的片段，展示了我们RAG工具包中的另一个强大工具。

3 路由与查询构建

我们的RAG系统越来越智能，但在现实世界中，知识并非存储在单一、统一的库中。

我们通常有多个数据源：不同编程语言的文档、内部维基、公共网站或带有结构化元数据的数据库。

路由与查询转换

将每个查询发送到每个源会效率低下，并可能导致嘈杂、不相关的结果。

这就是我们的RAG系统需要从一个简单的图书管理员演变为智能交换机操作员的地方。它需要能够首先分析传入的查询，然后将其路由到正确的目的地，或者构建一个更精确、结构化的查询进行检索。本节将深入探讨实现这一点的技术。

3.1 逻辑路由

路由是一个分类问题。给定用户的提问，我们需要将其分类到几个预定义的类别之一。虽然传统的机器学习模型可以做到这一点，但我们可以利用我们已有的强大推理引擎：LLM本身。

通过向LLM提供清晰的Schema（一组可能的类别），我们可以要求它为我们做出分类决策。

我们将首先使用Pydantic模型定义LLM输出的“契约”。这个Schema明确告诉LLM查询的可能目的地。

from typing import Literal
from langchain_core.pydantic_v1 import BaseModel, Fieldclass RouteQuery(BaseModel):"""A data model to route a user query to the most relevant datasource."""datasource: Literal["python_docs", "js_docs", "golang_docs"] = Field(...,description="Given a user question, choose which datasource would be most relevant for answering their question.",)

定义好Schema后，我们现在可以构建路由链。我们将使用一个提示来向LLM提供指令，然后使用.with_structured_output()方法确保其响应与我们的RouteQuery模型完美匹配。

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)structured_llm = llm.with_structured_output(RouteQuery)system = """You are an expert at routing a user question to the appropriate data source.Based on the programming language the question is referring to, route it to the relevant data source."""prompt = ChatPromptTemplate.from_messages([("system", system),("human", "{question}"),]
)router = prompt | structured_llm

现在，让我们测试一下我们的路由器。我们将向它传递一个明显是关于Python的问题，并检查输出。

question = """Why doesn't the following code work:from langchain_core.prompts import ChatPromptTemplateprompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""result = router.invoke({"question": question})print(result)

datasource='python_docs'

输出是我们的RouteQuery模型的一个实例，LLM正确地将python_docs识别为适当的数据源。这个结构化输出现在可以可靠地用于我们的代码中实现分支逻辑。

def choose_route(result):"""A function to determine the downstream logic based on the router's output."""if "python_docs" in result.datasource.lower():return "chain for python_docs"elif "js_docs" in result.datasource.lower():return "chain for js_docs"else:return "chain for golang_docs"full_chain = router | RunnableLambda(choose_route)final_destination = full_chain.invoke({"question": question})print(final_destination)

chain for python_docs

我们的交换机正确地路由了与Python相关的查询。这种方法对于构建多源RAG系统非常强大。

3.2 语义路由

逻辑路由在您有明确定义的类别时完美运行。但是，如果您想根据问题的风格或领域进行路由呢？例如，您可能希望以严肃、学术的语气回答物理问题，并以循序渐进、教学的方式回答数学问题。这就是**语义路由（Semantic Routing）**的用武之地。

语义路由

我们不是对查询进行分类，而是定义多个专家提示。

然后，我们嵌入用户的查询和每个提示模板，并使用余弦相似度来找到与查询语义最匹配的提示。

首先，让我们定义两个专家角色。

from langchain_core.prompts import PromptTemplatephysics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise and easy to understand manner. \
When you don't know the answer to a question you admit that you don't know.Here is a question:
{query}"""math_template = """You are a very good mathematician. You are great at answering math questions. \
You are so good because you are able to break down hard problems into their component parts, \
answer the component parts, and then put them together to answer the broader question.Here is a question:
{query}"""

现在，我们将创建执行嵌入和相似性比较的路由函数。

from langchain.utils.math import cosine_similarityembeddings = OpenAIEmbeddings()prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)def prompt_router(input):"""A function to route the input query to the most similar prompt template."""query_embedding = embeddings.embed_query(input["query"])similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]most_similar_index = similarity.argmax()chosen_prompt = prompt_templates[most_similar_index]print(f"DEBUG: Using {'MATH' if most_similar_index == 1 else 'PHYSICS'} template.")return PromptTemplate.from_template(chosen_prompt)

有了路由逻辑，我们就可以构建动态选择正确专家的完整链。

chain = ({"query": RunnablePassthrough()}| RunnableLambda(prompt_router)| ChatOpenAI()| StrOutputParser()
)print(chain.invoke("What's a black hole"))

DEBUG: Using PHYSICS template.
A black hole is a region of spacetime where gravity is so strong
that nothing—no particles or even electromagnetic radiation such as
light—can escape from it. The boundary of no escape is called the
event horizon. Although it has a great effect on the fate
and circumstances of an object crossing it, it has no locally
detectable features. In many ways, a black hole acts as an ideal black body,
as it reflects no light.

完美。路由器正确地将问题识别为与物理相关，并使用了物理教授的提示，从而得到了一个简洁准确的答案。这项技术非常适合创建能够根据用户需求调整其角色的专业智能体。

3.3 查询结构化

到目前为止，我们一直专注于从非结构化文本中检索。但大多数真实世界的数据是半结构化的；它包含有价值的元数据，如日期、作者、浏览量或类别。简单的向量搜索无法利用这些信息。

**查询结构化（Query Structuring）**是将自然语言问题转换为结构化查询的技术，该结构化查询可以使用这些元数据过滤器进行高度精确的检索。

为了说明这一点，让我们看看YouTube视频转录本中可用的元数据。

from langchain_community.document_loaders import YoutubeLoaderdocs = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=pbAd8O1Lvm4", add_video_info=True
).load()print(docs[0].metadata)

{'source': 'pbAd8O1Lvm4','title': 'Self-reflective RAG with LangGraph: Self-RAG and CRAG','description': 'Unknown','view_count': 11922,'thumbnail_url': 'https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg','publish_date': '2024-02-07 00:00:00','length': 1058,'author': 'LangChain'
}

此文档具有丰富的元数据：view_count、publish_date、length。我们希望用户能够使用自然语言筛选这些字段。为此，我们将定义另一个Pydantic Schema，这次用于结构化视频搜索查询。

import datetime
from typing import Optionalclass TutorialSearch(BaseModel):"""A data model for searching over a database of tutorial videos."""content_search: str = Field(..., description="Similarity search query applied to video transcripts.")title_search: str = Field(..., description="Alternate version of the content search query to apply to video titles.")min_view_count: Optional[int] = Field(None, description="Minimum view count filter, inclusive.")max_view_count: Optional[int] = Field(None, description="Maximum view count filter, exclusive.")earliest_publish_date: Optional[datetime.date] = Field(None, description="Earliest publish date filter, inclusive.")latest_publish_date: Optional[datetime.date] = Field(None, description="Latest publish date filter, exclusive.")min_length_sec: Optional[int] = Field(None, description="Minimum video length in seconds, inclusive.")max_length_sec: Optional[int] = Field(None, description="Maximum video length in seconds, exclusive.")def pretty_print(self) -> None:"""A helper function to print the populated fields of the model."""for field in self.__fields__:if getattr(self, field) is not None:print(f"{field}: {getattr(self, field)}")

这个Schema是我们的目标。现在我们将创建一个链，它接收用户问题并填充此模型。

system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a database query optimized to retrieve the most relevant results.If there are acronyms or words you are not familiar with, do not try to rephrase them."""prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
structured_llm = llm.with_structured_output(TutorialSearch)query_analyzer = prompt | structured_llm

让我们用几个不同的问题来测试它的强大功能。

query_analyzer.invoke({"question": "rag from scratch"}).pretty_print()

content_search: rag from scratch
title_search: rag from scratch

正如预期的那样，它填充了内容和标题搜索字段。现在是一个更复杂的查询。

query_analyzer.invoke({"question": "videos on chat langchain published in 2023"}
).pretty_print()

content_search: chat langchain
title_search: chat langchain 2023
earliest_publish_date: 2023-01-01
latest_publish_date: 2024-01-01

这太棒了。LLM正确地解释了“2023年发布”并创建了一个日期范围过滤器。让我们再尝试一个带有时间限制的。

query_analyzer.invoke({"question": "how to use multi-modal models in an agent, only videos under 5 minutes"}
).pretty_print()

content_search: multi-modal models agent
title_search: multi-modal models agent
max_length_sec: 300

它完美地将“5分钟以内”转换为max_length_sec: 300。这个结构化查询现在可以传递给支持元数据过滤的向量存储，从而实现远远超出简单语义搜索的极其精确和高效的检索。

4 高级索引策略

到目前为止，我们的索引方法一直很简单：将文档分割成块并进行嵌入。这有效，但有一个根本性的局限性。

小而集中的块对于检索准确性非常有用（它们包含的噪音较少），但它们通常缺乏LLM生成全面答案所需的更广泛上下文。

索引策略

相反，大块提供了很好的上下文，但在检索中表现不佳，因为它们的核心含义被稀释了。

这就是经典的“块大小”困境。我们如何才能两全其美？

答案在于更高级的索引策略，这些策略将用于检索的文档表示与用于生成的文档表示分开。让我们深入探讨。

4.1 多表示索引

多表示索引的核心思想简单而强大：我们不嵌入完整的文档块，而是为每个块创建一个更小、更集中的表示（例如摘要）并嵌入它。

在检索过程中，我们搜索这些简洁的摘要。一旦找到最佳摘要，我们就会使用其ID查找并检索完整的原始文档块。

这样，我们既能获得搜索小而密集摘要的精确性，又能获得用于生成的较大父文档的丰富上下文。

首先，我们需要加载一些文档来处理。我们将从Lilian Weng的博客中获取两篇文章。

from langchain_community.document_loaders import WebBaseLoaderloader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())print(f"Loaded {len(docs)} documents.")

Loaded 2 documents.

接下来，我们将创建一个链来为每个文档生成摘要。

import uuidsummary_chain = ({"doc": lambda x: x.page_content}| ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")| ChatOpenAI(model="gpt-3.5-turbo", max_retries=0)| StrOutputParser()
)summaries = summary_chain.batch(docs, {"max_concurrency": 5})print(summaries[0])

The document discusses building autonomous agents powered by Large
Language Models (LLMs). It outlines the key components of such a system, ...

现在是关键部分。我们需要一个MultiVectorRetriever，它需要两个主要组件：

一个vectorstore来存储我们摘要的嵌入。
一个docstore（一个简单的键值存储）来保存原始的完整文档。

from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Documentvectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())store = InMemoryByteStore()
id_key = "doc_id"retriever = MultiVectorRetriever(vectorstore=vectorstore,byte_store=store,id_key=id_key,
)doc_ids = [str(uuid.uuid4()) for _ in docs]summary_docs = [Document(page_content=s, metadata={id_key: doc_ids[i]})for i, s in enumerate(summaries)
]retriever.vectorstore.add_documents(summary_docs)retriever.docstore.mset(list(zip(doc_ids, docs)))

我们的高级索引现已构建完成。让我们测试检索过程。我们将提出一个关于“智能体中的记忆”的问题，看看会发生什么。

query = "Memory in agents"sub_docs = vectorstore.similarity_search(query, k=1)
print("--- Result from searching summaries ---")
print(sub_docs[0].page_content)
print("\n--- Metadata showing the link to the parent document ---")
print(sub_docs[0].metadata)--- Result from searching summaries ---
The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs) as their core controllers. It covers components such as planning, memory, and tool use, along with case studies and proof-of-concept examples like AutoGPT and GPT-Engineer. Challenges like finite context length, planning difficulties, and reliability of natural language interfaces are also highlighted. The document provides references to related research papers and offers a comprehensive overview of LLM-powered autonomous agents.--- Metadata showing the link to the parent document ---
{'doc_id': 'cf31524b-fe6a-4b28-a980-f5687c9460ea'}

正如您所看到的，搜索找到了提及“记忆”的摘要。现在，MultiVectorRetriever将使用此摘要元数据中的doc_id自动从docstore中获取完整的父文档。

retrieved_docs = retriever.get_relevant_documents(query, n_results=1)print("\n--- The full document retrieved by the MultiVectorRetriever ---")
print(retrieved_docs[0].page_content[0:500])
#### OUTPUT ####
"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ ...."

这正是我们想要的！我们搜索了简洁的摘要，但却得到了完整的、上下文丰富的文档，解决了块大小的困境。

4.2 分层索引（RAPTOR）知识树

理论： RAPTOR（递归抽象处理用于树状组织检索）将多表示思想更进一步。RAPTOR不只是一层摘要，它构建了一个多级摘要树。它首先对小的文档块进行聚类，然后对每个聚类进行摘要。

RAPTOR（来自LangChain Docs）

然后，它将这些摘要再次聚类，并对新的聚类进行摘要。这个过程重复进行，创建了一个从细粒度细节到高级概念的知识层次结构。当您查询时，您可以在此树的不同级别进行搜索，从而实现按需特定或通用的检索。

这是一种更高级的技术，虽然我们不会在这里实现完整的算法，但您可以在RAPTOR Cookbook中找到深入的讲解和完整的代码。它代表了结构化索引的前沿技术。

4.3 令牌级精度（ColBERT）

理论： 标准嵌入模型为整个文本块创建一个单一向量（这称为“词袋”方法）。这可能会丢失很多细微差别。

在这里插入图片描述

专用嵌入

**ColBERT（基于BERT的上下文后期交互）**提供了一种更细粒度的方法。它为文档中的每个单独的令牌生成一个单独的、上下文感知的嵌入。

当您进行查询时，ColBERT也会嵌入查询中的每个令牌。然后，它不是比较一个文档向量与一个查询向量，而是找到每个查询令牌与任何文档令牌之间的最大相似度。

这种“后期交互”允许对相关性有更细致的理解，在关键词式搜索方面表现出色。

我们可以通过RAGatouille库轻松使用ColBERT。

!pip install -U ragatouille

from ragatouille import RAGPretrainedModelRAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

现在，让我们使用ColBERT独特的令牌级方法索引一个维基百科页面。

import requestsdef get_wikipedia_page(title: str):"""A helper function to retrieve content from Wikipedia."""URL = "https://en.wikipedia.org/w/api.php"params = { "action": "query", "format": "json", "titles": title, "prop": "extracts", "explaintext": True }headers = {"User-Agent": "MyRAGApp/1.0"}response = requests.get(URL, params=params, headers=headers)data = response.json()page = next(iter(data["query"]["pages"].values()))return page.get("extract")full_document = get_wikipedia_page("Hayao_Miyazaki")RAG.index(collection=[full_document],index_name="Miyazaki-ColBERT",max_document_length=180,split_documents=True,
)

索引过程更复杂，因为它为每个令牌创建嵌入，但RAGatouille无缝地处理了它。现在，让我们搜索新索引。

results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
print(results)

[{'content': 'In April 1984, ...', 'score': 25.9036, 'rank': 1, ...},{'content': 'Hayao Miyazaki ...', 'score': 25.5716, 'rank': 2, ...},{'content': 'Glen Keane said ...', 'score': 24.8411, 'rank': 3, ...}]

第一个结果直接提到了吉卜力工作室的成立。我们也可以轻松地将其封装为标准的LangChain检索器。

colbert_retriever = RAG.as_langchain_retriever(k=3)retrieved_docs = colbert_retriever.invoke("What animation studio did Miyazaki found?")
print(retrieved_docs[0].page_content)

In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.=== Studio Ghibli ===
==== Early films (1985–1996) ====
In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli's first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki's designs for the film's setting were inspired by Greek architecture and "European urbanistic templates".

ColBERT提供了一种强大、细粒度的替代传统向量搜索的方法，表明我们构建知识库的方式与我们搜索它的方式同样重要。

5 高级检索与生成

我们已经创建了一个复杂的RAG系统，具有智能路由和高级索引。现在，我们已经到达了最后阶段：检索和生成。在这里，我们确保提供给LLM的上下文具有尽可能高的质量，并且LLM的最终答案是相关的、准确的，并以该上下文为基础。

在这里插入图片描述

检索/生成

即使有最好的索引，我们最初的检索仍然可能包含噪音，即不那么相关的文档可能会漏掉。而LLM，尽管功能强大，有时也会误解上下文或产生幻觉。

本节介绍作为我们管道最终质量控制层的高级技术。

5.1 专用重排

标准检索方法为我们提供了文档的排名列表，但这种初始排名并不总是完美的。**重排（Re-ranking）**是一个关键的第二遍步骤，我们在此步骤中获取最初检索到的文档集，并使用更复杂（且通常更昂贵）的模型根据它们与查询的相关性重新排序。

专用重排

这确保了最相关的文档被放置在我们提供给LLM的上下文的最顶部。

我们已经在RAG-Fusion部分看到了一个强大的重排方法：倒数排名融合（RRF）。这是一种很好的、无模型的方法来组合结果。但为了更强大的方法，我们可以使用专用的重排模型，例如Cohere提供的模型。

让我们首先设置一个标准检索器。我们将使用我们之前示例中的同一篇博客文章。

loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",))
blog_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300, chunk_overlap=50)
splits = text_splitter.split_documents(blog_docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

现在，我们引入ContextualCompressionRetriever。这个特殊的检索器封装了我们的基本检索器并添加了一个“压缩器”步骤。在这里，我们的压缩器将是CohereRerank模型。

它将从我们的基本检索器中获取10个文档并重新排序，只返回最相关的文档。

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerankcompressor = CohereRerank()compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,base_retriever=retriever
)question = "What is task decomposition for LLM agents?"
compressed_docs = compression_retriever.get_relevant_documents(question)print("--- Re-ranked and Compressed Documents ---")
for doc in compressed_docs:print(f"Relevance Score: {doc.metadata['relevance_score']:.4f}")print(f"Content: {doc.page_content[:150]}...\n")

--- Re-ranked and Compressed Documents ---
Relevance Score: 0.9982
Content: Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.", "What are the subgoals for achieving XYZ?", (2) by using task...Relevance Score: 0.9851
Content: Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into mult...Relevance Score: 0.9765
Content: LLM-powered autonomous agents have been an exciting concept. They can be used for task decomposition by prompting, using task-specific instructions, or ...

输出结果非常显著。CohereRerank模型不仅重新排序了文档，还为每个文档分配了relevance_score。现在我们可以更加确信，我们传递给LLM的上下文是最高质量的，直接导致了更好、更准确的答案。

5.2 使用AI智能体进行自校正

如果我们的RAG系统在给出答案之前能够检查自己的工作呢？这就是自校正RAG架构（如CRAG（纠正性RAG）和Self-RAG）背后的思想。

在这里插入图片描述

自校正RAG（来自Langchain博客）

这些不仅仅是简单的链条，它们是动态图（通常使用LangGraph构建），能够推理检索信息的质量并决定行动方案。

CRAG： 如果检索到的文档与给定查询不相关或含糊不清，CRAG系统不会简单地将它们传递给LLM。相反，它会触发新的、更强大的网络搜索以查找更好的信息，纠正检索到的文档，然后继续生成。
Self-RAG： 这种方法更进一步。在每个步骤中，它使用LLM生成“反思令牌”来批判该过程。它评估检索到的文档的相关性。如果不相关，它会再次检索。一旦它有了好的文档，它就会生成一个答案，然后评估该答案的事实一致性，确保它以源文档为基础。

这些技术代表了构建可靠、生产级RAG的最新技术。从头开始实现它们涉及构建状态机或图。虽然完整的实现很广泛，但您可以在此处找到出色、详细的演练：

CRAG Notebook
Self-RAG Notebook

这些智能体框架是超越简单问答机器人，创建真正健壮的推理引擎的关键。

5.3 长上下文的影响

RAG中一个反复出现的主题是LLM有限的上下文窗口。但随着拥有巨大上下文窗口（128k、200k甚至100万个令牌）的模型兴起，一个问题出现了：

长上下文

**我们还需要RAG吗？**我们能把所有文档都塞进提示中吗？

答案是微妙的。虽然长上下文模型功能强大，但它们并非万能药。

研究表明，当关键信息埋藏在非常长的上下文中间时（“大海捞针”问题），它们的性能可能会下降。

RAG的优势： RAG擅长首先找到这根针，然后只将其呈现给LLM。它是一个精密工具。
长上下文的优势： 长上下文模型非常适合需要同时从文档许多不同部分综合信息的任务，这是RAG可能遗漏的。

未来可能是一种混合方法：使用RAG进行初始的、精确的检索最相关文档，然后将这种高质量、预过滤的上下文输入到长上下文模型中进行最终综合。

要深入了解这个主题，这份演示文稿是一个很好的资源：

关于长上下文的幻灯片： 长上下文对RAG的影响

6 手动RAG评估

我们已经构建了一个日益复杂的RAG管道，并叠加了高级的检索、索引和生成技术。但一个关键问题仍然存在：我们如何证明它确实有效？

在生产环境中，“看起来有效”是不够的。我们需要客观、可重复的指标来衡量性能、识别弱点并指导改进。

这就是评估的用武之地。它是衡量我们RAG系统责任的科学。在这一部分，我们将探讨如何通过从第一性原理构建我们自己的评估器来量化衡量我们系统的质量。

6.1 核心指标：我们应该衡量什么？

在我们深入研究代码之前，让我们定义一个“好的”RAG响应是什么样的。我们可以将其分解为几个核心原则：

忠实性（Faithfulness）： 答案是否严格遵守所提供的上下文？忠实性答案不会捏造信息或使用LLM预训练的知识来回答。这是防止幻觉的最重要指标。
正确性（Correctness）： 与“真实情况”或参考答案相比，答案在事实上是否正确？
上下文相关性（Contextual Relevancy）： 我们检索到的上下文是否与用户的查询实际相关？这评估的是我们检索器的性能，而不是生成器的性能。

让我们探讨如何衡量这些，从最透明的方法开始：自己构建评估器。

6.2 使用LangChain从零构建评估器

理解评估的最佳方式是构建它。使用基本的LangChain组件，我们可以创建自定义链，指示LLM充当公正的“评判员”，根据我们在提示中定义的标准对RAG系统的输出进行评分。这为我们提供了最大的控制和透明度。

让我们从**正确性（Correctness）**开始。我们的目标是创建一个链，将generated_answer与ground_truth答案进行比较，并返回一个0到1的分数。

from langchain.prompts import PromptTemplatellm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)class ResultScore(BaseModel):score: float = Field(..., description="The score of the result, ranging from 0 to 1 where 1 is the best possible score.")correctness_prompt = PromptTemplate(input_variables=["question", "ground_truth", "generated_answer"],template="""Question: {question}Ground Truth: {ground_truth}Generated Answer: {generated_answer}Evaluate the correctness of the generated answer compared to the ground truth.Score from 0 to 1, where 1 is perfectly correct and 0 is completely incorrect.Score:"""
)correctness_chain = correctness_prompt | llm.with_structured_output(ResultScore)

现在，让我们将其封装在一个简单的函数中并进行测试。如果真实情况是“巴黎和马德里”，但我们的RAG系统只部分回答了“巴黎”怎么办？

def evaluate_correctness(question, ground_truth, generated_answer):"""A helper function to run our custom correctness evaluation chain."""result = correctness_chain.invoke({"question": question,"ground_truth": ground_truth,"generated_answer": generated_answer})return result.scorequestion = "What is the capital of France and Spain?"
ground_truth = "Paris and Madrid"
generated_answer = "Paris"
score = evaluate_correctness(question, ground_truth, generated_answer)print(f"Correctness Score: {score}")

Correctness Score: 0.5

这是一个完美的结果。我们的评判LLM正确地推断出生成的答案只对了一半，并给出了0.5的适当分数。

接下来，让我们为**忠实性（Faithfulness）**构建一个评估器。对于RAG来说，这可能比正确性更重要，因为它是我们防止幻觉的主要防御。

在这里，评判LLM必须忽略答案在事实上是否正确，并且只关心答案是否可以从给定的context中推导出来。

faithfulness_prompt = PromptTemplate(input_variables=["question","context", "generated_answer"],template="""Question: {question}Context: {context}Generated Answer: {generated_answer}Evaluate if the generated answer to the question can be deduced from the context.Score of 0 or 1, where 1 is perfectly faithful *AND CAN BE DERIVED FROM THE CONTEXT* and 0 otherwise.You don't mind if the answer is correct; all you care about is if the answer can be deduced from the context.[... a few examples from the notebook to guide the LLM ...]Example:Question: What is 2+2?Context: 4.Generated Answer: 4.In this case, the context states '4', but it does not provide information to deduce the answer to 'What is 2+2?', so the score should be 0."""
)faithfulness_chain = faithfulness_prompt | llm.with_structured_output(ResultScore)

我们在提示中提供了几个示例来指导LLM的推理，特别是对于棘手的边缘情况。让我们用“2+2”的例子来测试它，这是一个经典的忠实性测试。

def evaluate_faithfulness(question, context, generated_answer):"""A helper function to run our custom faithfulness evaluation chain."""result = faithfulness_chain.invoke({"question": question,"context": context,"generated_answer": generated_answer})return result.scorequestion = "what is 3+3?"
context = "6"
generated_answer = "6"
score = evaluate_faithfulness(question, context, generated_answer)print(f"Faithfulness Score: {score}")

Faithfulness Score: 0.0

这展示了定义良好的忠实性指标的力量和精确性。即使答案6在事实上是正确的，它也无法从提供的上下文“6”中逻辑推导出来。

上下文没有说3+3等于6。我们的系统正确地将其标记为一个不忠实的答案，这很可能是LLM使用了自己的预训练知识而不是提供的上下文而产生的幻觉。

从头开始构建这些评估器可以深入了解我们正在衡量的内容。然而，这可能很耗时。在下一部分中，我们将看到如何使用专门的评估框架更有效地实现相同的结果。

7 使用框架进行评估

在上一部分中，我们从头构建了我们自己的评估链。这是理解RAG指标核心原则的绝佳方式。

然而，为了更快、更强大的测试，专用评估框架是首选。

使用框架进行评估

这些库提供了预构建的、经过微调的指标，为我们处理评估的复杂性，使我们能够专注于分析结果。

我们将探索三个流行的框架：deepeval、grouse和RAG特定的强大工具RAGAS。

7.1 使用`deepeval`进行快速评估

deepeval是一个功能强大、开源的框架，旨在使LLM评估变得简单直观。它提供了一组定义明确的指标，可以轻松应用于RAG管道的输出。

工作流程涉及创建LLMTestCase对象并使用Correctness、Faithfulness和ContextualRelevancy等预构建指标对其进行测量。

from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCasetest_case_correctness = LLMTestCase(input="What is the capital of Spain?",expected_output="Madrid is the capital of Spain.",actual_output="MadriD."
)test_case_faithfulness = LLMTestCase(input="what is 3+3?",actual_output="6",retrieval_context=["6"]
)evaluation_results = evaluate(test_cases=[test_case_correctness, test_case_faithfulness],metrics=[GEval(name="Correctness", model="gpt-4o"), FaithfulnessMetric()]
)print(evaluation_results)

✨ Evaluation Results ✨
-------------------------
Overall Score: 0.50
-------------------------
Metrics Summary:
- Correctness: 1.00
- Faithfulness: 0.00
-------------------------

deepeval的聚合视图立即为我们提供了系统性能的高级概览，使我们能够轻松发现需要改进的领域。

7.2 使用`grouse`的另一个强大替代方案

grouse是另一个优秀的开源选项，提供了一套类似的指标，但其独特之处在于允许深度定制“评判员”提示。这对于针对特定领域微调评估标准非常有用。

from grouse import EvaluationSample, GroundedQAEvaluatorevaluator = GroundedQAEvaluator()
unfaithful_sample = EvaluationSample(input="Where is the Eiffel Tower located?",actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.",references=["The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France","Gustave Eiffel died in his appartment at Rue Rabelais in Paris."]
)result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]
print(f"Grouse Faithfulness Score (0 or 1): {result.faithfulness.faithfulness}")

Grouse Faithfulness Score (0 or 1): 0

与deepeval一样，grouse有效地捕捉了细微的错误，为我们的评估工具包提供了另一个强大的工具。

7.3 使用`RAGAS`进行评估

虽然deepeval和grouse是很好的通用评估器，但**RAGAS****（检索增强生成评估）**是一个专门为评估RAG管道而构建的框架。它提供了一套全面的指标，可以衡量系统从检索器到生成器的每个组件。

要使用RAGAS，我们首先需要以特定格式准备我们的评估数据。它要求每个测试用例提供四个关键信息：

question：用户的输入查询。
answer：RAG系统生成的最终答案。
contexts：检索器检索到的文档列表。
ground_truth：正确的参考答案。

让我们准备一个示例数据集。

questions = ["What is the name of the three-headed dog guarding the Sorcerer's Stone?","Who gave Harry Potter his first broomstick?","Which house did the Sorting Hat initially consider for Harry?",
]generated_answers = ["The three-headed dog is named Fluffy.","Professor McGonagall gave Harry his first broomstick, a Nimbus 2000.","The Sorting Hat strongly considered putting Harry in Slytherin.",
]ground_truth_answers = ["Fluffy","Professor McGonagall","Slytherin",
]retrieved_documents = [["A massive, three-headed dog was guarding a trapdoor. Hagrid mentioned its name was Fluffy."],["First years are not allowed brooms, but Professor McGonagall, head of Gryffindor, made an exception for Harry."],["The Sorting Hat muttered in Harry's ear, 'You could be great, you know, it's all here in your head, and Slytherin will help you on the way to greatness...'"],
]

接下来，我们使用Hugging Face的datasets库来构建这些数据，RAGAS可以无缝地与它集成。

from datasets import Datasetdata_samples = {'question': questions,'answer': generated_answers,'contexts': retrieved_documents,'ground_truth': ground_truth_answers
}dataset = Dataset.from_dict(data_samples)

现在，我们可以定义我们的指标并运行评估。RAGAS开箱即用提供了几个强大的、RAG特定的指标。

from ragas import evaluate
from ragas.metrics import (faithfulness,answer_relevancy,context_recall,answer_correctness,
)metrics = [faithfulness,answer_relevancy,context_recall,answer_correctness,
]result = evaluate(dataset=dataset,metrics=metrics
)results_df = result.to_pandas()
print(results_df)

RAGAS评估表

我们可以看到，我们的系统具有高度的忠实性，并且很好地检索了相关上下文（faithfulness和context_recall都非常完美）。答案也高度相关且正确，只有轻微的偏差。

RAGAS使得运行这种全面的端到端评估变得异常容易，为我们提供了自信地部署和改进RAG应用程序所需的数据。

8 总结所有内容

那么，让我们总结一下我们在构建生产级RAG系统方面所做的工作。

在第一部分中，我们从头开始构建了一个基础RAG系统，涵盖了三个核心组件：索引我们的数据、检索相关上下文和生成最终答案。
在第二部分中，我们转向了高级查询转换，使用RAG-Fusion、分解和HyDE等技术重写和扩展用户问题，以实现更准确的检索。
在第三部分中，我们将管道转换为智能交换机，添加了路由以将查询定向到正确的数据源，并结构化查询以利用强大的元数据过滤器。
在第四部分中，我们专注于高级索引，探索了多表示索引和令牌级ColBERT等策略，以创建更智能、更高效的知识库。
在第五部分中，我们使用高级检索技术（如重排）来优先处理最佳上下文，并引入了智能体、自校正概念（如CRAG和Self-RAG）来完善最终输出。
最后，在第六和第七部分中，我们解决了关键的评估步骤。我们学习了如何通过从头构建评估器以及使用deepeval、grouse和RAGAS等强大框架，通过忠实性和正确性等关键指标来衡量系统性能。