Milvus 视角看重排序模型(Rerankers)
在信息检索和生成式人工智能领域,重排序器是优化初始搜索结果顺序的重要工具。重排序器与传统的嵌入模型不同,它将查询和文档作为输入,并直接返回相似度得分,而不是嵌入。该得分表示输入查询和文档之间的相关性。
重排序器通常在第一阶段检索之后使用,通常通过向量近似最近邻 (ANN) 技术完成。虽然 ANN 搜索能够高效地获取大量潜在相关的结果,但它们并不总是根据结果与查询的实际语义接近程度来确定优先级。此时,重排序器会通过更深入的上下文分析来优化结果顺序,通常会利用 BERT 或其他基于 Transformer 的高级机器学习模型。通过这种方式,重排序器可以显著提高呈现给用户的最终结果的准确性和相关性。
PyMilvus 模型库集成了重排序功能,用于优化初始搜索返回结果的排序。从 Milvus 检索到最近的嵌入后,您可以利用这些重排序工具来优化搜索结果,从而提高搜索结果的准确率。
Rerank Function | API or Open-sourced |
---|---|
BGE | Open-sourced |
Cross Encoder | Open-sourced |
Voyage | API |
Cohere | API |
Jina AI | API |
示例 1:使用 BGE rerank 函数根据查询对文档进行重新排序
在此示例中,我们演示了如何使用基于特定查询的BGE 重新排序器对搜索结果进行重新排序。
要将重新排序器与PyMilvus 模型库一起使用,首先安装 PyMilvus 模型库以及包含所有必要的重新排序实用程序的模型子包:
pip install pymilvus[model]
# or pip install "pymilvus[model]" for zsh.
要使用 BGE 重新排序器,首先导入BGERerankFunction
类:
from pymilvus.model.reranker import BGERerankFunction
然后,创建一个BGERerankFunction
重新排名的实例:
bge_rf = BGERerankFunction(model_name="BAAI/bge-reranker-v2-m3", # Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
要根据查询对文档重新排序,请使用以下代码:
query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"documents = ["In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.","The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.","In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.","The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]bge_rf(query, documents)
预期输出类似于以下内容:
[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]
示例 2:使用重新排序器增强搜索结果的相关性
在本指南中,我们将探索如何利用search()
PyMilvus 中的方法进行相似性搜索,以及如何使用重排序器增强搜索结果的相关性。我们的演示将使用以下数据集:
entities = [{'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."}, {'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."}, {'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'}, {'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]
数据集组件:
doc_id
:每个文档的唯一标识符。doc_vector
:表示文档的向量嵌入。有关生成嵌入的指导,请参阅嵌入。doc_text
:文档的文本内容。
准备工作
在启动相似性搜索之前,您需要与 Milvus 建立连接,创建一个集合,并准备数据并将其插入到该集合中。以下代码片段演示了这些准备步骤。
from pymilvus import MilvusClient, DataTypeclient = MilvusClient(uri="http://10.102.6.214:19530" # replace with your own Milvus server address
)client.drop_collection('test_collection')# define schemaschema = client.create_schema(auto_id=False, enabel_dynamic_field=True)schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")# define index paramsindex_params = client.prepare_index_params()index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})# create collectionclient.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)# insert data into collectionclient.insert(collection_name="test_collection", data=entities)# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}
进行相似性搜索
数据插入后,使用该方法进行相似性搜索search
。
# search results based on our queryres = client.search(collection_name="test_collection",data=[[-0.045217834, 0.035171617, ..., -0.025117004]], # replace with your query vectorlimit=3,output_fields=["doc_id", "doc_text"]
)for i in res[0]:print(f'distance: {i["distance"]}')print(f'doc_text: {i["entity"]["doc_text"]}')
预期输出类似于以下内容:
distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.
使用重新排序器来增强搜索结果
然后,通过重新排序步骤来提高搜索结果的相关性。在本例中,我们使用CrossEncoderRerankFunction
内置的 PyMilvus 对结果进行重新排序,以提高准确率。
# use reranker to rerank search resultsfrom pymilvus.model.reranker import CrossEncoderRerankFunctionce_rf = CrossEncoderRerankFunction(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", # Specify the model name.device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)reranked_results = ce_rf(query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',documents=["In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.","The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.","In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.","The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."],top_k=3
)# print the reranked results
for result in reranked_results:print(f'score: {result.score}')print(f'doc_text: {result.text}')
预期输出类似于以下内容:
score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.