chroma在langchain中的使用 (Document Question Answering)
完整代码
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
from langchain_chroma import Chroma
from langchain_community.embeddings import ZhipuAIEmbeddings
your api key
embeddings = ZhipuAIEmbeddings(
model="Embedding-3",
api_key="your api key",
)
vectordb = Chroma.from_documents(texts, embeddings)
from langchain_core.vectorstores import VectorStoreRetriever
retriever = VectorStoreRetriever(vectorstore=vectordb)
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
temperature=0,
model="GLM-4-Plus",
openai_api_key="your api key",
openai_api_base="https://open.bigmodel.cn/api/paas/v4/"
)
retrievalQA = RetrievalQA.from_llm(llm=llm, retriever=retriever)
query = "What did the president say about Ketanji Brown Jackson"
retrievalQA.invoke(query)
{'query': 'What did the president say about Ketanji Brown Jackson',
'result': "The president spoke highly of Ketanji Brown Jackson, highlighting her qualifications and the broad support she has received. Specifically, he mentioned the following points about her:\n\n1. Professional Background:\n - She is a former top litigator in private practice.\n - She has served as a federal public defender.\n\n2. Family Background:\n - She comes from a family of public school educators and police officers.\n\n3. Character Traits:\n - She is described as a consensus builder.\n\n4. Support for Her Nomination:\n - Since her nomination, she has received support from a diverse group, including the Fraternal Order of Police and former judges appointed by both Democrats and Republicans.\n\n5. Her Nomination:\n - The president nominated her to serve on the United States Supreme Court, emphasizing that she is one of the nation's top legal minds.\n - He expressed confidence that she would continue the legacy of excellence set by Justice Stephen Breyer.\n\nThese points illustrate the president's strong endorsement of Ketanji Brown Jackson and his belief in her suitability for the Supreme Court."}
代码解释
以下是为代码添加的逐段文字说明,帮助理解各部分的实现原理和功能:
整体流程说明
本示例展示了如何使用Chroma DB向量数据库和LangChain框架构建文档问答系统。主要流程分为:文档加载→文本分块→向量化存储→问答链构建→查询处理,共5个关键步骤。
1. 文档加载
from langchain.document_loaders import TextLoader
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
- 功能:从本地文本文件加载原始文档数据
- 实现:
- 使用
TextLoader
加载器读取文本文件 state_of_the_union.txt
为示例文件,实际使用时替换为自有文档路径- 支持多种文档格式(PDF/HTML等),需更换对应Loader
- 使用
2. 文本分块
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # 每个文本块最大字符数
chunk_overlap=0 # 块间重叠字符数
)
texts = text_splitter.split_documents(documents)
- 必要性:
- 大模型有上下文长度限制
- 长文档需分块便于后续向量检索
- 参数说明:
chunk_size
:控制信息密度,值过大会丢失细节,过小破坏上下文chunk_overlap
:保持块间连贯性,常见设置为chunk_size的10-20%
3. 向量化存储
from langchain_chroma import Chroma
from langchain_community.embeddings import ZhipuAIEmbeddings
# 初始化智谱AI嵌入模型
embeddings = ZhipuAIEmbeddings(
model="Embedding-3", # 选择第三代嵌入模型
api_key="your_api_key" # 替换为实际API密钥
)
# 创建向量数据库
vectordb = Chroma.from_documents(
documents=texts, # 分块后的文档
embedding=embeddings # 选择的嵌入模型
)
# 构建检索器
from langchain_core.vectorstores import VectorStoreRetriever
retriever = VectorStoreRetriever(vectorstore=vectordb)
- 核心组件:
- 嵌入模型:将文本转化为数值向量,本示例使用智谱AI的Embedding-3模型
- 向量数据库:Chroma DB存储向量数据,支持高效相似度搜索
- 注意事项:
- API密钥需从智谱AI平台申请
- 可通过更换
embedding
参数使用其他嵌入模型(如OpenAI、HuggingFace等)
4. 问答链构建
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# 初始化大语言模型
llm = ChatOpenAI(
temperature=0, # 控制生成随机性(0=确定性最高)
model="GLM-4-Plus", # 使用智谱GLM-4-Plus模型
openai_api_key="your_api_key",
openai_api_base="https://open.bigmodel.cn/api/paas/v4/"
)
# 构建检索增强问答链
retrievalQA = RetrievalQA.from_llm(
llm=llm, # 语言模型
retriever=retriever # 文档检索器
)
- 架构原理:
- 接收用户问题后,首先通过
retriever
从向量库检索相关文档块 - 将问题+相关上下文组合成prompt输入大模型
- 大模型基于上下文生成最终答案
- 接收用户问题后,首先通过
- 模型选择:
- 示例使用GLM-4-Plus,支持替换为其他LangChain兼容模型(如GPT-4、Claude等)
temperature
参数控制生成多样性(0-1范围,值越大结果越随机)
5. 执行查询
query = "What did the president say about Ketanji Brown Jackson"
response = retrievalQA.invoke(query)
print(response['result'])
- 使用方式:
- 构造查询问题,调用
invoke()
方法获取答案 - 返回结果包含原始查询和模型生成的答案
- 构造查询问题,调用
- 结果示例:
The president nominated Ketanji Brown Jackson to the Supreme Court, highlighting her experience as a former public defender and consensus builder...
通过这个流程,可以快速构建基于自有知识库的智能问答系统,适用于客服助手、知识管理等场景。
参考链接: https://github.com/hwchase17/chroma-langchain/tree/master