Langchain入门:构建一个PDF摄取和问答系统
需要安装pypdf包
文档:https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf
from langchain_community.document_loaders import PyPDFLoader
file_path = "./414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf"
loader = PyPDFLoader(file_path)docs = loader.load()
print(len(docs))
print(docs[0].page_content[:100])
print(docs[0].metadata)
- 加载器将指定路径的PDF读取到内存中。
- 然后,它使用 pypdf 包提取文本数据。
- 最后,它为PDF的每一页创建一个LangChain 文档,包含该页的内容和一些关于文本来源于文档的元数据。
使用RAG进行问答
使用文本分割器,您将把加载的文档分割成更小的文档,以便更容易适应LLM的上下文窗口,然后将它们加载到向量存储中。然后,您可以从向量存储中创建一个检索器以在我们的RAG链中使用
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vector_store = Chroma.from_documents(splits, embedding=OpenAIEmbeddings(openai_api_base="https://api.siliconflow.cn/v1/",openai_api_key=os.environ["siliconFlow"],model="Qwen/Qwen3-Embedding-8B"
))retriever = vector_store.as_retriever()
最后,您将使用一些内置助手构建最终的 rag_chain:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplatesystem_prompt = ("You are an assistant for question-answering tasks. ""Use the following pieces of retrieved context to answer ""the question. If you don't know the answer, say that you ""don't know. Use three sentences maximum and keep the ""answer concise.""\n\n""{context}"
)prompt = ChatPromptTemplate.from_messages([("system", system_prompt),("human", "{input}"),]
)question_answer_chain = create_stuff_documents_chain(llm, prompt=prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})results
进一步检查 context 下的值,您可以看到它们是每个包含摄取页面内容块的文档。值得注意的是,这些文档还保留了您最初加载时的原始元数据:
print(results["context"][0].page_content)
print(results["context"][0].metadata)