当前位置：首页 > news >正文

MiniRAG检索流程详细图解

news 2025/11/17 5:01:34

MiniRAG检索流程详细图解

根据代码实现，MiniRAG的检索流程可分为以下关键步骤：

1. 查询预处理与关键词提取

kw_prompt_temp = PROMPTS["minirag_query2kwd"]
TYPE_POOL, TYPE_POOL_w_CASE = await knowledge_graph_inst.get_types()
kw_prompt = kw_prompt_temp.format(query=query, TYPE_POOL=TYPE_POOL)
result = await use_model_func(kw_prompt)

keywords_data = json_repair.loads(result)
type_keywords = keywords_data.get("answer_type_keywords", [])
entities_from_query = keywords_data.get("entities_from_query", [])[:5]

系统使用LLM提取两类关键信息：

查询中的实体关键词（entities_from_query）
可能的答案类型（answer_type_keywords）

2. 实体查找与候选路径构建

for ent in ent_from_query:
    results_node = await entity_name_vdb.query(ent, top_k=query_param.top_k)
    nodes_from_query_list.append(results_node)
    ent_from_query_dict[ent] = [e["entity_name"] for e in results_node]

candidate_reasoning_path = {}
for results_node_list in nodes_from_query_list:
    candidate_reasoning_path_new = {
        key["entity_name"]: {"Score": key["distance"], "Path": []}
        for key in results_node_list
    }
    candidate_reasoning_path = {**candidate_reasoning_path, **candidate_reasoning_path_new}

for key in candidate_reasoning_path.keys():
    candidate_reasoning_path[key]["Path"] = await knowledge_graph_inst.get_neighbors_within_k_hops(key, 2)
    imp_ents.append(key)

系统为每个查询实体：

在entity_name_vdb中查找相似实体
记录每个实体的相似度分数
获取每个实体的2跳邻居作为路径

3. 路径分类与过滤

short_path_entries = {name: entry for name, entry in candidate_reasoning_path.items() if len(entry["Path"]) < 1}
sorted_short_path_entries = sorted(short_path_entries.items(), key=lambda x: x[1]["Score"], reverse=True)
save_p = max(1, int(len(sorted_short_path_entries) * 0.2))
top_short_path_entries = sorted_short_path_entries[:save_p]
top_short_path_dict = {name: entry for name, entry in top_short_path_entries}
long_path_entries = {name: entry for name, entry in candidate_reasoning_path.items() if len(entry["Path"]) >= 1}
candidate_reasoning_path = {**long_path_entries, **top_short_path_dict}

系统将路径分为短路径和长路径：

短路径：没有邻居的路径
长路径：有邻居的路径
保留所有长路径，短路径只保留相似度最高的20%

4. 预测答案类型与路径评分

node_datas_from_type = await knowledge_graph_inst.get_node_from_types(type_keywords)
maybe_answer_list = [n["entity_name"] for n in node_datas_from_type]
imp_ents = imp_ents + maybe_answer_list
scored_reasoning_path = cal_path_score_list(candidate_reasoning_path, maybe_answer_list)

系统根据预测的答案类型：

获取该类型的所有节点
构建潜在答案列表
计算路径与潜在答案的重叠度
生成带分数的路径集

cal_path_score_list函数统计每条路径包含的潜在答案实体数量：

def cal_path_score_list(candidate_reasoning_path, maybe_answer_list):
    scored_reasoning_path = {}
    for k, v in candidate_reasoning_path.items():
        score = v["Score"]
        paths = v["Path"]
        scores = {}
        for p in paths:
            scores[p] = [count_elements_in_tuple(p, maybe_answer_list)]
        scored_reasoning_path[k] = {"Score": score, "Path": scores}
    return scored_reasoning_path

5. 关系查询与边投票评分

results_edge = await relationships_vdb.query(originalquery, top_k=len(ent_from_query) * query_param.top_k)
goodedge = []
badedge = []
for item in results_edge:
    if item["src_id"] in imp_ents or item["tgt_id"] in imp_ents:
        goodedge.append(item)
    else:
        badedge.append(item)
scored_edged_reasoning_path, pairs_append = edge_vote_path(scored_reasoning_path, goodedge)

系统进行关系查询和评分：

在relationships_vdb中查询与原始查询相关的边
将边分为与重要实体相关的"好边"和不相关的"坏边"
使用好边对路径进行投票评分

edge_vote_path函数检查路径中是否包含重要边：

def edge_vote_path(path_dict, edge_list):
    return_dict = copy.deepcopy(path_dict)
    EDGELIST = []
    pairs_append = {}
    for i in edge_list:
        EDGELIST.append((i["src_id"], i["tgt_id"]))
    for i in return_dict.items():
        for j in i[1]["Path"].items():
            if j[1]:
                count = 0
                for pairs in EDGELIST:
                    if is_continuous_subsequence(pairs, j[0]):
                        count = count + 1
                        if j[0] not in pairs_append:
                            pairs_append[j[0]] = [pairs]
                        else:
                            pairs_append[j[0]].append(pairs)
                j[1].append(count)
    return return_dict, pairs_append

6. 路径转换为文本块

scored_edged_reasoning_path = await path2chunk(
    scored_edged_reasoning_path,
    knowledge_graph_inst,
    pairs_append,
    originalquery,
    max_chunks_per_entity=3,
)

path2chunk函数将路径转换为文本块ID：

收集与路径相关的边的文本块
收集路径中节点的文本块
基于查询相似度选择文本块
计算文本块权重：chunk_weight = occurrence_count * (answer_score + edge_score + 1)
选择每个实体的最高分文本块

7. 文本块最终评分与选择

scorednode2chunk(ent_from_query_dict, scored_edged_reasoning_path)
results = await chunks_vdb.query(originalquery, top_k=int(query_param.top_k / 2))
chunks_ids = [r["id"] for r in results]
final_chunk_id = kwd2chunk(ent_from_query_dict, chunks_ids, chunk_nums=int(query_param.top_k / 2))

系统综合考虑实体相关和向量检索结果：

进行向量检索获取与查询相关的文本块ID
使用kwd2chunk函数对文本块进行最终评分
选择得分最高的文本块

kwd2chunk函数实现特殊的加权算法：

def kwd2chunk(ent_from_query_dict, chunks_ids, chunk_nums):
    final_chunk = Counter()
    for key, list_of_dicts in ent_from_query_dict.items():
        id_scores = {}
        for d in list_of_dicts:
            # 首个匹配实体的得分×2
            if d == list_of_dicts[0]:
                score = d["Score"] * 2
            else:
                score = d["Score"]
            path = d["Path"]
            for id in path:
                # 如果是路径首块且在向量结果中×10
                if id == path[0] and id in chunks_ids:
                    score = score * 10
                if id in id_scores:
                    id_scores[id] += score
                else:
                    id_scores[id] = score
        final_chunk.update(id_scores)
    return [i[0] for i in final_chunk.most_common(chunk_nums)]

8. 组装上下文与生成回答

系统组装实体信息和文本块内容为结构化上下文：

entities_context = list_of_list_to_csv(entites_section_list)
text_units_context = list_of_list_to_csv(text_units_section_list)
context = f"""
-----Entities-----
```csv
{entities_context}

-----Sources-----

{text_units_context}

“”"


最后，使用LLM和组装的上下文生成最终回答：
```python
sys_prompt = sys_prompt_temp.format(context_data=context, response_type=query_param.response_type)
response = await use_model_func(query, system_prompt=sys_prompt)

这个流程确保了检索结果同时考虑实体相似度、图结构关系和向量相似度，提供最相关的信息用于回答生成。

查看全文

http://www.dtcms.com/a/96994.html