当前位置：首页 > news >正文

RAG初筛方案实例验证-多种BM25方案

news 2025/8/23 13:14:12

除基于embeding的bge+milvus方案外，目前BM25依然是解决RAG初筛的利器，开销小速度快。

这里尝试几种目前典型和流行的BM25方案，rank_bm25和bm25s，测试代码来源于网络资料。

1 rank_bm25

Rank-BM25是基于python的开源文本检索库，实现多种BM25如Okapi BM25、BM25L和BM25+。

Rank-BM25支持文档索引、评分和排序，用户可自行预处理文本。

1）安装

pip install rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

2）测试

rank_bm25在中文环境下，需要借助于中文分词工具如jieba，以及停用词设置。

from rank_bm25 import BM25Okapi
import jieba.posseg as pseg# Create corpus of kg documents
corpus = ["BM25（Best Matching 25）是一种经典的信息检索算法","两行代码实现的搜索引擎","How is the weather today?"
]# Tokenize each document
stop_flag = []
stopwords = ["是", "的", "?"]
def get_doc_tokens(doc):words = pseg.cut(doc)doc_tokens = []for word, flag in words:if flag not in stop_flag and word not in stopwords:doc_tokens.append(word)return doc_tokenstokenized_corpus = []
for doc in corpus:doc_tokens = get_doc_tokens(doc)tokenized_corpus.append(doc_tokens)print(tokenized_corpus)bm25 = BM25Okapi(tokenized_corpus)# Query the BM25 index
query = "BM25"
tokenized_query = get_doc_tokens(query)doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)
doc = bm25.get_top_n(tokenized_query, corpus, n=3)
print(doc)

输出示例如下

[0.42639868 0. 0. ]
['BM25（Best Matching 25）是一种经典的信息检索算法', 'How is the weather today?', '两行代码实现的搜索引擎']

2 bm25s

bm25s基于python，仅依赖numpy和scipy。

与rank-bm25相比，bm25s在索引期通过即时(eager)计算BM25分数并将其存储到稀疏矩阵中，实现了高达500倍的速度提升。

1）安装

pip install bm25s -i https://pypi.tuna.tsinghua.edu.cn/simple

2）测试

bm25s原生只支持英文等字母语言，这里使用jieba改造tokenize，使其支持中文。

from bm25s.tokenization import Tokenized
import jieba
from typing import List, Union
from tqdm.auto import tqdmdef tokenize(texts,return_ids: bool = True,show_progress: bool = True,leave: bool = False,
) -> Union[List[List[str]], Tokenized]:if isinstance(texts, str):texts = [texts]corpus_ids = []token_to_index = {}for text in tqdm(texts, desc="Split strings", leave=leave, disable=not show_progress):splitted = jieba.lcut(text)doc_ids = []for token in splitted:if token not in token_to_index:token_to_index[token] = len(token_to_index)token_id = token_to_index[token]doc_ids.append(token_id)corpus_ids.append(doc_ids)# Create a list of unique tokens that we will use to create the vocabularyunique_tokens = list(token_to_index.keys())vocab_dict = token_to_index# Return the tokenized IDs and the vocab dictionary or the tokenized stringsif return_ids:return Tokenized(ids=corpus_ids, vocab=vocab_dict)else:# We need a reverse dictionary to convert the token IDs back to tokensreverse_dict = unique_tokens# We convert the token IDs back to tokens in-placefor i, token_ids in enumerate(tqdm(corpus_ids,desc="Reconstructing token strings",leave=leave,disable=not show_progress,)):corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids]return corpus_ids

bm25s测试代码示例

import bm25sbm25s.tokenize = tokenize# Create your corpus here
corpus = ["今天天气晴朗，我的心情美美哒","小明和小红一起上学","我们来试一试吧","我们一起学猫叫","我和Faker五五开","明天预计下雨，不能出去玩了",
]# corpus = load_corpus("data")# Tokenize the corpus and index it
corpus_tokens = bm25s.tokenize(corpus)
print(corpus_tokens)retriever = bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)query = "明天天气怎么样"
query_tokens = bm25s.tokenize(query)
docs, scores = retriever.retrieve(query_tokens, k=3)
print(f"Best result (score: {scores[0, 0]:.2f}): {docs[0, 0]}")
print(docs, scores)# Happy with your index? Save it for later...
retriever.save("bm25s_index_animals")# ...and load it when needed
ret_loaded = bm25s.BM25.load("bm25s_index_animals", load_corpus=True)

结果如下

Best result (score: 0.53): 明天预计下雨，不能出去玩了
[['明天预计下雨，不能出去玩了' '我和Faker五五开' '我们一起学猫叫']] [[0.5313357 0. 0. ]]

测试过程可能会遇到如下问题，解决方案参考附录

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

3 BM25其他实现

gensim库其中的BM25实现的源码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.htmlimport math
from six import iteritems
from six.moves import xrange# BM25 parameters.
PARAM_K1 = 1.5
PARAM_B = 0.75
EPSILON = 0.25class BM25(object):def __init__(self, corpus):self.corpus_size = len(corpus)self.avgdl = sum(map(lambda x: float(len(x)), corpus)) / self.corpus_sizeself.corpus = corpusself.f = []self.df = {}self.idf = {}self.initialize()def initialize(self):for document in self.corpus:frequencies = {}for word in document:if word not in frequencies:frequencies[word] = 0frequencies[word] += 1self.f.append(frequencies)for word, freq in iteritems(frequencies):if word not in self.df:self.df[word] = 0self.df[word] += 1for word, freq in iteritems(self.df):self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)def get_score(self, document, index, average_idf):score = 0for word in document:if word not in self.f[index]:continueidf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idfscore += (idf * self.f[index][word] * (PARAM_K1 + 1)/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.corpus_size / self.avgdl)))return scoredef get_scores(self, document, average_idf):scores = []for index in xrange(self.corpus_size):score = self.get_score(document, index, average_idf)scores.append(score)return scoresdef get_bm25_weights(corpus):bm25 = BM25(corpus)average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())weights = []for doc in corpus:scores = bm25.get_scores(doc, average_idf)weights.append(scores)return weights

附录

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

numpy, scipy之间版本不匹配导致，解决方案如下