RAG初筛方案实例验证-多种BM25方案
除基于embeding的bge+milvus方案外,目前BM25依然是解决RAG初筛的利器,开销小速度快。
这里尝试几种目前典型和流行的BM25方案,rank_bm25和bm25s,测试代码来源于网络资料。
1 rank_bm25
Rank-BM25是基于python的开源文本检索库,实现多种BM25如Okapi BM25、BM25L和BM25+。
Rank-BM25支持文档索引、评分和排序,用户可自行预处理文本。
1)安装
pip install rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple
2)测试
rank_bm25在中文环境下,需要借助于中文分词工具如jieba,以及停用词设置。
from rank_bm25 import BM25Okapi
import jieba.posseg as pseg# Create corpus of kg documents
corpus = ["BM25(Best Matching 25)是一种经典的信息检索算法","两行代码实现的搜索引擎","How is the weather today?"
]# Tokenize each document
stop_flag = []
stopwords = ["是", "的", "?"]
def get_doc_tokens(doc):words = pseg.cut(doc)doc_tokens = []for word, flag in words:if flag not in stop_flag and word not in stopwords:doc_tokens.append(word)return doc_tokenstokenized_corpus = []
for doc in corpus:doc_tokens = get_doc_tokens(doc)tokenized_corpus.append(doc_tokens)print(tokenized_corpus)bm25 = BM25Okapi(tokenized_corpus)# Query the BM25 index
query = "BM25"
tokenized_query = get_doc_tokens(query)doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)
doc = bm25.get_top_n(tokenized_query, corpus, n=3)
print(doc)
输出示例如下
[0.42639868 0. 0. ]
['BM25(Best Matching 25)是一种经典的信息检索算法', 'How is the weather today?', '两行代码实现的搜索引擎']
2 bm25s
bm25s基于python,仅依赖numpy和scipy。
与rank-bm25相比,bm25s在索引期通过即时(eager)计算BM25分数并将其存储到稀疏矩阵中,实现了高达500倍的速度提升。
1)安装
pip install bm25s -i https://pypi.tuna.tsinghua.edu.cn/simple
2)测试
bm25s原生只支持英文等字母语言,这里使用jieba改造tokenize,使其支持中文。
from bm25s.tokenization import Tokenized
import jieba
from typing import List, Union
from tqdm.auto import tqdmdef tokenize(texts,return_ids: bool = True,show_progress: bool = True,leave: bool = False,
) -> Union[List[List[str]], Tokenized]:if isinstance(texts, str):texts = [texts]corpus_ids = []token_to_index = {}for text in tqdm(texts, desc="Split strings", leave=leave, disable=not show_progress):splitted = jieba.lcut(text)doc_ids = []for token in splitted:if token not in token_to_index:token_to_index[token] = len(token_to_index)token_id = token_to_index[token]doc_ids.append(token_id)corpus_ids.append(doc_ids)# Create a list of unique tokens that we will use to create the vocabularyunique_tokens = list(token_to_index.keys())vocab_dict = token_to_index# Return the tokenized IDs and the vocab dictionary or the tokenized stringsif return_ids:return Tokenized(ids=corpus_ids, vocab=vocab_dict)else:# We need a reverse dictionary to convert the token IDs back to tokensreverse_dict = unique_tokens# We convert the token IDs back to tokens in-placefor i, token_ids in enumerate(tqdm(corpus_ids,desc="Reconstructing token strings",leave=leave,disable=not show_progress,)):corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids]return corpus_ids
bm25s测试代码示例
import bm25sbm25s.tokenize = tokenize# Create your corpus here
corpus = ["今天天气晴朗,我的心情美美哒","小明和小红一起上学","我们来试一试吧","我们一起学猫叫","我和Faker五五开","明天预计下雨,不能出去玩了",
]# corpus = load_corpus("data")# Tokenize the corpus and index it
corpus_tokens = bm25s.tokenize(corpus)
print(corpus_tokens)retriever = bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)query = "明天天气怎么样"
query_tokens = bm25s.tokenize(query)
docs, scores = retriever.retrieve(query_tokens, k=3)
print(f"Best result (score: {scores[0, 0]:.2f}): {docs[0, 0]}")
print(docs, scores)# Happy with your index? Save it for later...
retriever.save("bm25s_index_animals")# ...and load it when needed
ret_loaded = bm25s.BM25.load("bm25s_index_animals", load_corpus=True)
结果如下
Best result (score: 0.53): 明天预计下雨,不能出去玩了
[['明天预计下雨,不能出去玩了' '我和Faker五五开' '我们一起学猫叫']] [[0.5313357 0. 0. ]]
测试过程可能会遇到如下问题,解决方案参考附录
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
3 BM25其他实现
gensim库其中的BM25实现的源码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.htmlimport math
from six import iteritems
from six.moves import xrange# BM25 parameters.
PARAM_K1 = 1.5
PARAM_B = 0.75
EPSILON = 0.25class BM25(object):def __init__(self, corpus):self.corpus_size = len(corpus)self.avgdl = sum(map(lambda x: float(len(x)), corpus)) / self.corpus_sizeself.corpus = corpusself.f = []self.df = {}self.idf = {}self.initialize()def initialize(self):for document in self.corpus:frequencies = {}for word in document:if word not in frequencies:frequencies[word] = 0frequencies[word] += 1self.f.append(frequencies)for word, freq in iteritems(frequencies):if word not in self.df:self.df[word] = 0self.df[word] += 1for word, freq in iteritems(self.df):self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)def get_score(self, document, index, average_idf):score = 0for word in document:if word not in self.f[index]:continueidf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idfscore += (idf * self.f[index][word] * (PARAM_K1 + 1)/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.corpus_size / self.avgdl)))return scoredef get_scores(self, document, average_idf):scores = []for index in xrange(self.corpus_size):score = self.get_score(document, index, average_idf)scores.append(score)return scoresdef get_bm25_weights(corpus):bm25 = BM25(corpus)average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())weights = []for doc in corpus:scores = bm25.get_scores(doc, average_idf)weights.append(scores)return weights
附录
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
numpy, scipy之间版本不匹配导致,解决方案如下
pip uninstall numpy scipy pandas
pip install numpy scipy pandas
reference
---
rank_bm25
https://github.com/dorianbrown/rank_bm25
BM25 search example
https://github.com/ev2900/BM25_Search_Example
[论文笔记]BM25S:Python打造超越RANK-BM25的实现
https://blog.csdn.net/yjw123456/article/details/141023327
bm25s
https://github.com/xhluca/bm25s
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
https://arxiv.org/pdf/2407.03618
解决 ValueError: numpy.dtype size changed,may indicate binary incompatibility.
https://blog.csdn.net/2301_79648997/article/details/144145206
开源向量LLM - BGE (BAAI General Embedding)
https://blog.csdn.net/liliang199/article/details/149773775
开源向量LLM - Qwen3-Embedding
https://blog.csdn.net/liliang199/article/details/149743526