当前位置: 首页 > news >正文

RAG初筛方案实例验证-多种BM25方案

除基于embeding的bge+milvus方案外,目前BM25依然是解决RAG初筛的利器,开销小速度快。

这里尝试几种目前典型和流行的BM25方案,rank_bm25和bm25s,测试代码来源于网络资料。

1 rank_bm25

Rank-BM25是基于python的开源文本检索库,实现多种BM25如Okapi BM25、BM25L和BM25+。

Rank-BM25支持文档索引、评分和排序,用户可自行预处理文本。

1)安装

pip install rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

2)测试

rank_bm25在中文环境下,需要借助于中文分词工具如jieba,以及停用词设置。

from rank_bm25 import BM25Okapi
import jieba.posseg as pseg# Create corpus of kg documents
corpus = ["BM25(Best Matching 25)是一种经典的信息检索算法","两行代码实现的搜索引擎","How is the weather today?"
]# Tokenize each document
stop_flag = []
stopwords = ["是", "的", "?"]
def get_doc_tokens(doc):words = pseg.cut(doc)doc_tokens = []for word, flag in words:if flag not in stop_flag and word not in stopwords:doc_tokens.append(word)return doc_tokenstokenized_corpus = []
for doc in corpus:doc_tokens = get_doc_tokens(doc)tokenized_corpus.append(doc_tokens)print(tokenized_corpus)bm25 = BM25Okapi(tokenized_corpus)# Query the BM25 index
query = "BM25"
tokenized_query = get_doc_tokens(query)doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)
doc = bm25.get_top_n(tokenized_query, corpus, n=3)
print(doc)

输出示例如下

[0.42639868 0.         0.        ]
['BM25(Best Matching 25)是一种经典的信息检索算法', 'How is the weather today?', '两行代码实现的搜索引擎']

2 bm25s

bm25s基于python,仅依赖numpy和scipy。

与rank-bm25相比,bm25s在索引期通过即时(eager)计算BM25分数并将其存储到稀疏矩阵中,实现了高达500倍的速度提升。

1)安装

 pip install bm25s -i https://pypi.tuna.tsinghua.edu.cn/simple

2)测试

bm25s原生只支持英文等字母语言,这里使用jieba改造tokenize,使其支持中文。

from bm25s.tokenization import Tokenized
import jieba
from typing import List, Union
from tqdm.auto import tqdmdef tokenize(texts,return_ids: bool = True,show_progress: bool = True,leave: bool = False,
) -> Union[List[List[str]], Tokenized]:if isinstance(texts, str):texts = [texts]corpus_ids = []token_to_index = {}for text in tqdm(texts, desc="Split strings", leave=leave, disable=not show_progress):splitted = jieba.lcut(text)doc_ids = []for token in splitted:if token not in token_to_index:token_to_index[token] = len(token_to_index)token_id = token_to_index[token]doc_ids.append(token_id)corpus_ids.append(doc_ids)# Create a list of unique tokens that we will use to create the vocabularyunique_tokens = list(token_to_index.keys())vocab_dict = token_to_index# Return the tokenized IDs and the vocab dictionary or the tokenized stringsif return_ids:return Tokenized(ids=corpus_ids, vocab=vocab_dict)else:# We need a reverse dictionary to convert the token IDs back to tokensreverse_dict = unique_tokens# We convert the token IDs back to tokens in-placefor i, token_ids in enumerate(tqdm(corpus_ids,desc="Reconstructing token strings",leave=leave,disable=not show_progress,)):corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids]return corpus_ids

bm25s测试代码示例

import bm25sbm25s.tokenize = tokenize# Create your corpus here
corpus = ["今天天气晴朗,我的心情美美哒","小明和小红一起上学","我们来试一试吧","我们一起学猫叫","我和Faker五五开","明天预计下雨,不能出去玩了",
]# corpus = load_corpus("data")# Tokenize the corpus and index it
corpus_tokens = bm25s.tokenize(corpus)
print(corpus_tokens)retriever = bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)query = "明天天气怎么样"
query_tokens = bm25s.tokenize(query)
docs, scores = retriever.retrieve(query_tokens, k=3)
print(f"Best result (score: {scores[0, 0]:.2f}): {docs[0, 0]}")
print(docs, scores)# Happy with your index? Save it for later...
retriever.save("bm25s_index_animals")# ...and load it when needed
ret_loaded = bm25s.BM25.load("bm25s_index_animals", load_corpus=True)

结果如下

Best result (score: 0.53): 明天预计下雨,不能出去玩了
[['明天预计下雨,不能出去玩了' '我和Faker五五开' '我们一起学猫叫']] [[0.5313357 0.        0.   ]]

测试过程可能会遇到如下问题,解决方案参考附录

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

3 BM25其他实现

gensim库其中的BM25实现的源码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.htmlimport math
from six import iteritems
from six.moves import xrange# BM25 parameters.
PARAM_K1 = 1.5
PARAM_B = 0.75
EPSILON = 0.25class BM25(object):def __init__(self, corpus):self.corpus_size = len(corpus)self.avgdl = sum(map(lambda x: float(len(x)), corpus)) / self.corpus_sizeself.corpus = corpusself.f = []self.df = {}self.idf = {}self.initialize()def initialize(self):for document in self.corpus:frequencies = {}for word in document:if word not in frequencies:frequencies[word] = 0frequencies[word] += 1self.f.append(frequencies)for word, freq in iteritems(frequencies):if word not in self.df:self.df[word] = 0self.df[word] += 1for word, freq in iteritems(self.df):self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)def get_score(self, document, index, average_idf):score = 0for word in document:if word not in self.f[index]:continueidf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idfscore += (idf * self.f[index][word] * (PARAM_K1 + 1)/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.corpus_size / self.avgdl)))return scoredef get_scores(self, document, average_idf):scores = []for index in xrange(self.corpus_size):score = self.get_score(document, index, average_idf)scores.append(score)return scoresdef get_bm25_weights(corpus):bm25 = BM25(corpus)average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())weights = []for doc in corpus:scores = bm25.get_scores(doc, average_idf)weights.append(scores)return weights

附录

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

numpy, scipy之间版本不匹配导致,解决方案如下

pip uninstall numpy scipy pandas

pip install numpy scipy pandas

reference

---

rank_bm25

https://github.com/dorianbrown/rank_bm25

BM25 search example

https://github.com/ev2900/BM25_Search_Example

[论文笔记]BM25S:Python打造超越RANK-BM25的实现

https://blog.csdn.net/yjw123456/article/details/141023327

bm25s

https://github.com/xhluca/bm25s

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

https://arxiv.org/pdf/2407.03618

解决 ValueError: numpy.dtype size changed,may indicate binary incompatibility.

https://blog.csdn.net/2301_79648997/article/details/144145206

开源向量LLM - BGE (BAAI General Embedding)

https://blog.csdn.net/liliang199/article/details/149773775

开源向量LLM - Qwen3-Embedding

https://blog.csdn.net/liliang199/article/details/149743526

http://www.dtcms.com/a/346179.html

相关文章:

  • 类器官培养基系列,助力高效医学研究
  • Navicat连接MySQL-出现1045无法连接问题
  • AI实验管理神器:WandB全功能解析
  • 【python】os.mkdir() 和 os.makedirs()区别
  • 数学建模-灰色关联分析
  • map_set
  • Trie 树(字典树)
  • Rust 入门 注释和文档之 cargo doc (二十三)
  • 51单片机-中断系统
  • 【数据分享】各省及全国GDP增长指数(1980-2022)
  • 彻底解决 Windows 文件扩展名隐藏问题,注册表修改显示文件后缀方法
  • More Effective C++ 条款01:仔细区别 pointers 和 references
  • 构建城市数字孪生底座:深度解析智慧城市全景视频拼接融合解决方案
  • constraint_mode使用
  • 【Python】两条命令永久切国内源
  • Android 16环境开发的一些记录
  • C语言中的CSI_START和CSI_END宏
  • 拿到手一个前端项目,应该如何启动
  • 多目标跟踪中基于目标威胁度评估的传感器控制方法复现
  • lanczos算法学习笔记
  • 【GM3568JHF】FPGA+ARM异构开发板 测试命令
  • OFD格式文件及Python将PDF转换为OFD格式文件
  • Informer参数代码
  • SPI的DMA方式
  • 线性回归:从原理到实战的完整指南
  • ROS中的自定义消息
  • Windows 11 安装 Miniforge,配置国内源
  • 基层医疗遇到了什么问题?
  • 【spring security】权限管理组件执行流程详解
  • centos7安装oracle19c流程(自用)