当前位置：首页 > news >正文

大模型工程师学习日记（九）：基于 LangChain 构建向量存储和查询：Weaviate

news 2025/9/24 1:58:41

如何使用 langchain-weaviate 包在 LangChain 中开始使用 Weaviate 向量存储。

Weaviate 是一个开源的向量数据库。它允许您存储来自您喜爱的机器学习模型的数据对象和向量嵌入，并能够无缝地扩展到数十亿个数据对象。

官方文档：Home | Weaviate

要使用此集成，您需要运行一个 Weaviate 数据库实例。

最低版本

此模块需要 Weaviate 1.23.7 或更高版本。但是，我们建议您使用最新版本的 Weaviate。

连接到 Weaviate

在本文中，我们假设您在 http://localhost:8080 上运行了一个本地的 Weaviate 实例，并且端口 50051 用于 gRPC 通信。因此，我们将使用以下代码连接到 Weaviate：

weaviate_client = weaviate.connect_to_local()

其他部署选项

Weaviate 可以以许多不同的方式进行部署，例如使用Weaviate Cloud Services (WCS)、Docker或Kubernetes。

如果您的 Weaviate 实例以其他方式部署，可以在此处信息关于连接到 Weaviate 的不同方式。您可以使用不同的辅助函数，或者创建一个自定义实例。

请注意，您需要一个 v4 客户端 API，它将创建一个 weaviate.WeaviateClient 对象。

认证

一些 Weaviate 实例，例如在 WCS 上运行的实例，启用了认证，例如 API 密钥和/或用户名+密码认证。

阅读客户端认证指南以获取更多信息，以及深入的认证配置页面。

安装

# 安装包
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain

环境设置
本文使用 OpenAIEmbeddings 通过 OpenAI API。我们建议获取一个 OpenAI API 密钥，并将其作为名为 OPENAI_API_KEY 的环境变量导出。
完成后，您的 OpenAI API 密钥将被自动读取。如果您对环境变量不熟悉，可以在此处或此指南中关于它们的信息。
配置Weaviate的WCD_DEMO_URL和WCD_DEMO_RO_KEY

setx WCD_DEMO_URL ""
setx WCD_DEMO_RO_KEY ""

用法

通过相似性查找对象

以下是一个示例，演示如何通过查询查找与之相似的对象，从数据导入到查询 Weaviate 实例。

步骤 1：数据导入

首先，我们将创建要添加到 Weaviate 的数据，方法是加载并分块长文本文件的内容。

from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

现在，我们可以导入数据。要这样做，连接到 Weaviate 实例，并使用生成的 weaviate_client 对象。例如，我们可以将文档导入如下所示：

#示例：weaviate_client.py
weaviate_client = weaviate.connect_to_weaviate_cloud(
    cluster_url=wcd_url,  # Replace with your Weaviate Cloud URL
    auth_credentials=Auth.api_key(wcd_api_key),  # Replace with your Weaviate Cloud key
    headers={'X-OpenAI-Api-key': openai_api_key}  # Replace with your OpenAI API key
)
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)

第二步：执行搜索

现在我们可以执行相似度搜索。这将返回与查询文本最相似的文档，基于存储在 Weaviate 中的嵌入和从查询文本生成的等效嵌入。

#示例：weaviate_search.py
# pip install -Uqq langchain-weaviate
# pip install openai tiktoken langchain
import os
import weaviate
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_weaviate.vectorstores import WeaviateVectorStore
from weaviate.classes.init import Auth

embeddings = OpenAIEmbeddings()
# 加载文档并将其分割成片段
loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
documents = loader.load()
# 将其分割成片段
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Best practice: store your credentials in environment variables
wcd_url = os.environ["WCD_DEMO_URL"]
wcd_api_key = os.environ["WCD_DEMO_RO_KEY"]
openai_api_key = os.environ["OPENAI_API_KEY"]

weaviate_client = weaviate.connect_to_weaviate_cloud(
    cluster_url=wcd_url,  # Replace with your Weaviate Cloud URL
    auth_credentials=Auth.api_key(wcd_api_key),  # Replace with your Weaviate Cloud key
    headers={'X-OpenAI-Api-key': openai_api_key}  # Replace with your OpenAI API key
)
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
query = "Pixar公司是做什么的?"
docs = db.similarity_search(query)
print(docs[0].page_content)
weaviate_client.close()

量化结果相似性

您可以选择检索相关性“分数”。这是一个相对分数，表示特定搜索结果在搜索结果池中的好坏程度。

请注意，这是相对分数，意味着不应用于确定相关性的阈值。但是，它可用于比较整个搜索结果集中不同搜索结果的相关性。

#示例：weaviate_similarity.py
query = "Pixar公司是做什么的?"
docs = db.similarity_search_with_score(query, k=5)
for doc in docs:
    print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")

输出结果

0.700 : During the next five years, I started a company named NeXT, another company named Pixar, and fell in...
0.337 : I was lucky – I found what I loved to do early in life. Woz and I started Apple in my parents garage...
0.271 : I really didn't know what to do for a few months. I felt that I had let the previous generation of e...
0.256 : I'm pretty sure none of this would have happened if I hadn't been fired from Apple. It was awful tas...
0.191 : Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its...

查看全文

http://www.dtcms.com/a/50371.html