当前位置: 首页 > news >正文

如何基于ElasticsearchRetriever构建RAG系统

ElasticSearch以其文本快速检索闻名,是构建文档类知识库的首选。

这里尝试基于ElasticsearchRetriever,基于langchain,构建RAG知识库系统。

1 elasticsearch

1.1 elasticsearc

elasticsearch是一款分布式的RESTful搜索分析引擎,提供了一个支持多租户的分布式全文搜索引擎,具备HTTP网络接口和无模式JSON文档存储特性,支持关键词搜索、向量搜索、混合搜索及复杂过滤功能。

这里elasticsearch通过docker安装,安装命令如下

获取ES镜像并为ES创建docker网络

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.1.3

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:9.1.3

启动ES

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:9.1.3

具体过程参考

https://blog.csdn.net/liliang199/article/details/151581138

1.2 ElasticsearchRetriever

ElasticsearchRetriever是一个基于langchain的通用封装器,可通过Query DSL灵活访问Elasticsearch的所有功能。在大多数使用场景中,其他类(如ElasticsearchStore、ElasticsearchEmbeddings等)已能满足需求,但若遇到特殊需求时,则可选用ElasticsearchRetriever。

如需了解ElasticsearchRetriever全部功能与配置的详细说明,请参阅API参考文档

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

2 langchain

3.1 langchain

这里基于langchain集成elasticsearch,所以需要准备langchain环境,安装代码如下所示。

pip install -qU langchain-community langchain-elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -qU langchain-openai  -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 大模型配置

这里沿用langchain的使用习惯,使用OpenAI接口兼容的deepseek-r1大模型,具体为通过OneAPI中转调用Deepseek R1大模型。

实现过程参考

https://blog.csdn.net/liliang199/article/details/151393128

代码示例如下

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 ...llm = ChatOpenAI(model="deepseek-r1")

3 测试验证

这里先验证ES的连接、数据导入、查询,然后通过ElasticsearchRetriever将es集成到langchain中,构建真实可运行的RAG系统。

3.1 ES连接

es_url表示ES的部署地址,passwd表示ES中elastic用户的密码。

ssl_context则表示es提供服务的证书,获取方式参考如下链接

https://blog.csdn.net/liliang199/article/details/151586083

因为使用OpenAI接口兼容的LLM模型,所以还需要配置令牌api_key和模型部署地址base_url。

示例代码如下

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 import ssl
from typing import Any, Dict, Iterablefrom elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from langchain_community.embeddings import DeterministicFakeEmbedding
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_elasticsearch import ElasticsearchRetrieverssl_context = ssl.create_default_context(cafile='./http_ca.crt') # ES证书
es_url = "https://localhost:9200" # ES部署地址
passwd = "the passwd of the user elastic" # elastic用户的密码es_client = Elasticsearch(hosts=[es_url],  basic_auth=('elastic', passwd), ssl_context=ssl_context)
es_client.info()

输出如下,说明ES连接成功。

ObjectApiResponse({'name': 'xxxx', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'xxxxxx, ..., 'tagline': 'You Know, for Search'})

3.2 数据导入

这里虚构7条测试数据texts = ["foo",  "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

并未7条测试数据创建向量索引,向量计算采用随机生成,对应DeterministicFakeEmbedding,实际环境可替换为OllamaEmbeddings。

实现代码如下所示,具体过程与批量向量化和导入ES的过程类似。

embeddings = DeterministicFakeEmbedding(size=3)
index_name = "test-langchain-retriever_v0"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = ["foo","bar","world","hello world","hello","foo bar","bla bla foo",
]def create_index(es_client: Elasticsearch,index_name: str,text_field: str,dense_vector_field: str,num_characters_field: str,
):es_client.indices.create(index=index_name,mappings={"properties": {text_field: {"type": "text"},dense_vector_field: {"type": "dense_vector"},num_characters_field: {"type": "integer"},}},)def index_data(es_client: Elasticsearch,index_name: str,text_field: str,dense_vector_field: str,embeddings: Embeddings,texts: Iterable[str],refresh: bool = True,
) -> None:create_index(es_client, index_name, text_field, dense_vector_field, num_characters_field)vectors = embeddings.embed_documents(list(texts))requests = [{"_op_type": "index","_index": index_name,"_id": i,text_field: text,dense_vector_field: vector,num_characters_field: len(text),}for i, (text, vector) in enumerate(zip(texts, vectors))]bulk(es_client, requests)if refresh:es_client.indices.refresh(index=index_name)return len(requests)index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

正常情况下,输入如下所示。

7

3.3 数据查询

ES有不同的索引查询方式,比如向量查询、bm2.5查询、混合查询等,针对不同查询方式,需要构建对应的ElasticsearchRetriever。

1)向量查询

向量查询代码如下所示,主要为构建查询函数vecter_query,并构建vector_retriever。

在查询函数中,需要将问题search_query向量化,然后将获得的向量传入query_vector。

def vector_query(search_query: str) -> Dict:vector = embeddings.embed_query(search_query)  # same embeddings as for indexingreturn {"knn": {"field": dense_vector_field,"query_vector": vector,"k": 5,"num_candidates": 10,}}vector_retriever = ElasticsearchRetriever(index_name=index_name,body_func=vector_query,content_field=text_field,es_client=es_client
)print("dd")vector_retriever.invoke("foo")

输出如下所示

[Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9987202, '_source': {'fake_embedding': [-2.336764233933763, 0.27510289545940503, -0.7957597268194339], 'num_characters': 3}}, page_content='foo'),
 ...

2)BM25查询

BM25其实就是传统字符串匹配查询,代码如下所示。

def bm25_query(search_query: str) -> Dict:return {"query": {"match": {text_field: search_query,},},}bm25_retriever = ElasticsearchRetriever(index_name=index_name,body_func=bm25_query,content_field=text_field,es_client=es_client
)bm25_retriever.invoke("foo")

输出如下

[Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': [-2.336764233933763, 0.27510289545940503, -0.7957597268194339], 'num_characters': 3}}, page_content='foo'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': [0.2533670476638539, 0.08100381646160418, 0.7763644080870179], 'num_characters': 7}}, page_content='foo bar'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': [1.7365927060137358, -0.5230400847844948, 0.7978339724186192], 'num_characters': 11}}, page_content='bla bla foo')]

3)混合查询

混合查询,就是在查询函数中,同时定义标准字符串匹配查询、knn向量查询,查询结果采用 Reciprocal Rank Fusion (RRF) 混合。

def hybrid_query(search_query: str) -> Dict:vector = embeddings.embed_query(search_query)  # same embeddings as for indexingreturn {"retriever": {"rrf": {"retrievers": [{"standard": {"query": {"match": {text_field: search_query,}}}},{"knn": {"field": dense_vector_field,"query_vector": vector,"k": 5,"num_candidates": 10,}},]}}}hybrid_retriever = ElasticsearchRetriever(index_name=index_name,body_func=bm25_query,content_field=text_field,es_client=es_client
)hybrid_retriever.invoke("foo")

输出示例如下

[Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': [-2.336764233933763, 0.27510289545940503, -0.7957597268194339], 'num_characters': 3}}, page_content='foo'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': [0.2533670476638539, 0.08100381646160418, 0.7763644080870179], 'num_characters': 7}}, page_content='foo bar'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': [1.7365927060137358, -0.5230400847844948, 0.7978339724186192], 'num_characters': 11}}, page_content='bla bla foo')]

4)模糊查询

示例代码如下,就是基于 typo tolerance的字符串匹配。

def fuzzy_query(search_query: str) -> Dict:return {"query": {"match": {text_field: {"query": search_query,"fuzziness": "AUTO",}},},}fuzzy_retriever = ElasticsearchRetriever(index_name=index_name,body_func=fuzzy_query,content_field=text_field,es_client=es_client
)fuzzy_retriever.invoke("fox")  # note the character tolernace

输出如下

[Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.6474311, '_source': {'fake_embedding': [-2.336764233933763, 0.27510289545940503, -0.7957597268194339], 'num_characters': 3}}, page_content='foo'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.49580228, '_source': {'fake_embedding': [0.2533670476638539, 0.08100381646160418, 0.7763644080870179], 'num_characters': 7}}, page_content='foo bar'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.40171927, '_source': {'fake_embedding': [1.7365927060137358, -0.5230400847844948, 0.7978339724186192], 'num_characters': 11}}, page_content='bla bla foo')]
 

5)复杂过滤

定义多种过滤方式,如must、must_not、should等,以提高查询效率。

代码示例如下

def filter_query_func(search_query: str) -> Dict:return {"query": {"bool": {"must": [{"range": {num_characters_field: {"gte": 5}}},],"must_not": [{"prefix": {text_field: "bla"}},],"should": [{"match": {text_field: search_query}},],}}}filtering_retriever = ElasticsearchRetriever(index_name=index_name,body_func=filter_query_func,content_field=text_field,es_client=es_client
)filtering_retriever.invoke("foo")

输出如下

[Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 1.7437035, '_source': {'fake_embedding': [0.2533670476638539, 0.08100381646160418, 0.7763644080870179], 'num_characters': 7}}, page_content='foo bar'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '2', '_score': 1.0, '_source': {'fake_embedding': [-0.7041151202179595, -1.4652961969276497, -0.25786766898672847], 'num_characters': 5}}, page_content='world'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '3', '_score': 1.0, '_source': {'fake_embedding': [0.42728413221815387, -1.1889908285425348, -1.445433230084671], 'num_characters': 11}}, page_content='hello world'),
 Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '4', '_score': 1.0, '_source': {'fake_embedding': [-0.28560441330564046, 0.9958894823084921, 1.5489829880195058], 'num_characters': 5}}, page_content='hello')]

6)文档映射

将ES的查询结果映射到langchain Document中,查询函数依然采用复杂过滤filter_query_func,结果融合函数num_characters_mapper示例如下,可依据实际情况自定义。

def num_characters_mapper(hit: Dict[str, Any]) -> Document:num_chars = hit["_source"][num_characters_field]content = hit["_source"][text_field]return Document(page_content=f"This document has {num_chars} characters",metadata={"text_content": content},)custom_mapped_retriever = ElasticsearchRetriever.from_es_params(index_name=index_name,body_func=filter_query_func,document_mapper=num_characters_mapper,url=es_url,
)custom_mapped_retriever.invoke("foo")

输出示例如下

[Document(metadata={'text_content': 'foo bar'}, page_content='This document has 7 characters'),
 Document(metadata={'text_content': 'world'}, page_content='This document has 5 characters'),
 Document(metadata={'text_content': 'hello world'}, page_content='This document has 11 characters'),
 Document(metadata={'text_content': 'hello'}, page_content='This document has 5 characters')]

3.4  langchian

这里基于之前验证的ElasticsearchRetriever,结合ChatOpenAI自定义大模型,构建一个完整的langchain RAG系统,chain定义如下

chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

整体代码示例如下

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAIprompt = ChatPromptTemplate.from_template("""Answer the question based only on the context provided.Context: {context}Question: {question}"""
)llm = ChatOpenAI(model="deepseek-r1")def format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)chain = ({"context": vector_retriever | format_docs, "question": RunnablePassthrough()}| prompt| llm| StrOutputParser()
)

然后,启动chain,完整真实的RAG知识库检索。

chain.invoke("what is foo?")

输出如下所示,可见这里不仅实现了检索功能,而且依据大模型对检索结果进行了有效的处理。

'Based on the provided context, "foo" appears in two instances:\n1. In the line: "bla bla foo"\n2. In the line: "foo bar"\n\nThe context does not explicitly define what "foo" is, but it is used as part of example text alongside other placeholder terms like "bla," "bar," "hello," and "world." No further explanation or meaning is given for "foo" in the context.'

reference

---

ElasticsearchRetriever

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

Mac本地docker安装Kibana+ElasticSearch

https://blog.csdn.net/liliang199/article/details/151581138

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

 ElasticsearchRetriever构建参数说明

https://python.langchain.com/api_reference/elasticsearch/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

Reciprocal rank fusion

https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion

OneAPI-通过OpenAI API访问所有大模型

https://blog.csdn.net/liliang199/article/details/151393128

http://www.dtcms.com/a/477605.html

相关文章:

  • 网站建设内容与实现功能免费信息发布网有哪些
  • 【Java】nacos的作用(事例)以及其如何发挥功能的?
  • 杨辉三角的变形
  • 试从源码角度分析Handler的post和sendMessage方法的区别和应用场景?
  • 网站流量如何突破厦门建设局局长李德才
  • 外包网站设计网站建设的主要技术路线
  • iOS 26 电耗监测与优化,耗电问题实战 + 多工具 辅助策略
  • 企业前端网站模板下载 HTML前端模板网站
  • 【RabbitMQ】 RabbitMQ Overview
  • uniapp+vue3+vite+ts+xr-frame实现ar+vr渲染踩坑记
  • 如何选择适合的加密方法来保护云计算中的数据
  • Linux 云计算核心技术:原理、组件与 K8s 实战部署
  • aws docker安装,ec2安装docker-compose
  • 2025上海国际数据中心及云计算产业展览会影响力如何?有啥亮点?
  • 关于网站建设的意义亚马逊计划裁员1万人
  • 南宁做网站培训网页游戏排行榜前十2023
  • Windows ACL 原理详解与使用示例
  • Rider下Avalonia 项目启动问题完整解决方案
  • MySQL默认端口为何是3306?修改后如何管理?
  • 中间件常用组件的原理和设计
  • Java EE初阶启程记13---JUC(java.util.concurrent) 的常见类
  • 25.负载均衡-Nginx、HAProxy、LVS 全解析
  • ubantu的adb命令(首次安装adb)
  • 辽宁平台网站建设哪里好电商网站怎样优化
  • 万商惠网站建设系统开发人才网站建设经费用途
  • 欧普建站做网站需要apache
  • 天津做网站企业天津定制网站建设商店设计
  • 天猫网站设计大连市那里做网站宣传的好
  • Linux curl 与 wget 区别
  • Centos7详细安装过程