如何基于DSL脚本进行elasticsearch向量检索示例
向量检索是目前RAG库主要依赖功能,这里示例如何在elasticsearch中构建dense_vector向量库,并示例基于DSL脚本的向量相似度检索过程
1 es向量检索
1.1 DSL检索
es访问dense_vector的DSL方法有cosinessimilarity, dotProduct, 1norm或l2norm函数。
需要注意每个DSL脚本只能调用这些函数一次。不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。
如果需要该功能,可以通过直接访问向量值来重新实现这些函数。
1.2 建立连接
假设elasticsearch环境已经构建
以下时连接elasticsearch的代码示例
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
es.info()
输出如下所示
ObjectApiResponse({'name': '87eb7cfd07bb', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HE1OJj1aQYmnnRL2nwVNgw', 'version': {'number': '8.11.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '64cf052f3b56b1fd4449f5454cb88aca7e739d9a', 'build_date': '2023-12-08T11:33:53.634979452Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})
2 虚拟简单示例
这里通过虚构简单的三维向量,示例向量入库和检索过程。
2.1 建立索引
构建文本字段和向量字段,采用虚拟三维向量。
index_name = "vec_index_0"
es.indices.create(index=index_name,mappings = {"properties": {"my_vector": {"type": "dense_vector","dims": 3},"my_text" : {"type" : "keyword"}}})
输出如下
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_0'})
2.2 插入数据
数据插入代码如下所示,并虚构3个三维向量。
from elasticsearch import Elasticsearch, helperstexts = ["Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.","Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.","Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = [[0.5, 10, 6],[-0.5, 10, 10],[-1, 11, 3]
]
print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)
输出如下
3 3
(3, [])
2.3 向量检索
cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。
这里通过DSL脚本应用cosine相似度计算,并加1方式相似度为负。
检索代码如下所示。
vec = [-0.5, 10, 6]
print(len(vec))query = {"query": {"script_score": {"query": {"match_all": {}},"script": {"source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0","params": {"queryVector": [-0.5, 10, 6]}}}}
}res = es.search(index=index_name, body=query)
print(res)
输出如下
{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.9963301, 'hits': [{'_index': 'vec_query_test_v4', '_id': 'erQPgpoBzVHvZ0eub0Bb', '_score': 1.9963301, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.5, 10, 6]}}, {'_index': 'vec_query_test_v4', '_id': 'e7QPgpoBzVHvZ0eub0Bb', '_score': 1.9701604, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [-0.5, 10, 10]}}, {'_index': 'vec_query_test_v4', '_id': 'fLQPgpoBzVHvZ0eub0Bb', '_score': 1.9618319, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [-1, 11, 3]}}]}}
3 实际向量示例
3.1 创建向量引擎
这里基于langchain的OllamaEmbeddings创建向量计算引起,并假设ollama和langchain已安装。
创建代码示例如下
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="bge-m3")
3.2 建立索引
构建文本字段和向量字段,采用bge-m3的1024维向量。
index_name = "vec_index_ollama"
es.indices.create(index=index_name,mappings = {"properties": {"my_vector": {"type": "dense_vector","dims": 1024},"my_text" : {"type" : "keyword"}}})
输出如下
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_ollama'})
3.3 插入数据
数据插入代码如下所示,并虚构3个bge-m3向量。
from elasticsearch import Elasticsearch, helperstexts = ["Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.","Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.","Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = embeddings.embed_documents(texts)print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)
输出如下
3 3
(3, [])
3.4 向量查询
检索代码如下所示,在计算相似度之前,这里需要先用ollama embedding计算查询向量,。
程序代码示例如下。
vec = embeddings.embed_query("Rocky Mountain National Park")
print(len(vec))query = {"query": {"script_score": {"query": {"match_all": {}},"script": {"source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0","params": {"queryVector": vec}}}}
}res = es.search(index=index_name, body=query)
print(res)
输出如下
1024
{'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.5386453, 'hits': [{'_index': 'vec_index_ollama', '_id': 'grQbgpoBzVHvZ0eubkCf', '_score': 1.5386453, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [0.030506486, -0.008855448, -0.0542615, 0.032204177, -0.03877517, -0.061205674, ..., 0.012325888, 0.018495426, 0.018682847]}}, {'_index': 'vec_index_ollama', '_id': 'gbQbgpoBzVHvZ0eubkCf', '_score': 1.4467602, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [0.019682772, 0.0011470582, -0.055675734, 0.019353507, 0.0018442876, -0.04981726, -0.0319025, 0.003043613, 0.0074340384, 0.004517093, 0.0070879683, -0.0009796194, ..., 0.036385845, -0.03322887, 0.032527614, 0.047814477]}}, {'_index': 'vec_index_ollama', '_id': 'gLQbgpoBzVHvZ0eubkCf', '_score': 1.4366195, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.04341205, 0.010377382, -0.053659115, 0.034629174, -0.040680032, -0.05273915, -0.021637093, -0.014424741, 0.003394217, 0.005586104, 0.010616635, 0.039135892, -0.021042265, -0.0010980334, ..., -0.0015615255, 0.02610852, 0.015156581, 0.010001613, 0.034023464]}}]}}
reference
---
向量数据库:使用Elasticsearch实现向量数据存储与搜索
https://zhuanlan.zhihu.com/p/634015951
