当前位置：首页 > news >正文

【LLM实战|llamaIndex】llamaIndex介绍和RAG

news 2025/10/2 21:10:11

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. 前言

!pip install --upgrade llama-index!pip install llama-index-llms-dashscope
!pip install llama-index-llms-openai-like
!pip install llama-index-embeddings-dashscope

1. llama_index 案例初体验

import os
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
from llama_index.embeddings.dashscope import DashScopeEmbedding, DashScopeTextEmbeddingModels# LlamaIndex默认使用的大模型被替换为百炼
# Settings.llm = OpenAILike(
#     model="qwen-max",
#     api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
#     api_key=os.getenv("DASHSCOPE_API_KEY"),
#     is_chat_model=True
# )Settings.llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX, api_key=os.getenv("DASHSCOPE_API_KEY"))# LlamaIndex默认使用的Embedding模型被替换为百炼的Embedding模型
Settings.embed_model = DashScopeEmbedding(# model_name="text-embedding-v1"model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V1,# api_key=os.getenv("DASHSCOPE_API_KEY")
)

import os
print("API Key exists:", os.getenv("DASHSCOPE_API_KEY") is not None)

API Key exists: True

from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderdocuments = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("deepseek v3有多少参数？")print(response)

DeepSeek-V3 拥有总计671亿个参数，每个令牌激活37亿个参数。

2. LlamaIndex

2.1 数据加载

2.1.1 本地数据加载

import json
from pydantic.v1 import BaseModeldef show_json(data):"""用于展示json数据"""if isinstance(data, str):obj = json.loads(data)print(json.dumps(obj, indent=4, ensure_ascii=False))elif isinstance(data, dict) or isinstance(data, list):print(json.dumps(data, indent=4, ensure_ascii=False))elif issubclass(type(data), BaseModel):print(json.dumps(data.dict(), indent=4, ensure_ascii=False))def show_list_obj(data):"""用于展示一组对象"""if isinstance(data, list):for item in data:show_json(item)else:raise ValueError("Input is not a list")

from llama_index.core import SimpleDirectoryReaderreader = SimpleDirectoryReader(input_dir="./data", # 目标目录recursive=False, # 是否递归遍历子目录required_exts=[".pdf"] # (可选)只读取指定后缀的文件)
documents = reader.load_data()

print(documents[0].text)
show_json(documents[0].json())

DeepSeek-V3 Technical Report
DeepSeek-AI
research@deepseek.com
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms
other open-source models and achieves performance comparable to leading closed-source
models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours
for its full training. In addition, its training process is remarkably stable. Throughout the entire
training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
MMLU-Pro
(EM)
GPQA-Diamond
(Pass@1)
MATH 500
(EM)
AIME 2024
(Pass@1)
Codeforces
(Percentile)
SWE-bench Verified
(Resolved)
0
20
40
60
80
100Accuracy / Percentile (%)
75.9
59.1
90.2
39.2
51.6
42.0
66.2
41.3
74.7
16.7
35.6
22.6
71.6
49.0
80.0
23.3 24.8 23.8
73.3
51.1
73.8
23.3
25.3 24.5
72.6
49.9
74.6
9.3
23.6
38.8
78.0
65.0
78.3
16.0
20.3
50.8
DeepSeek-V3 DeepSeek-V2.5 Qwen2.5-72B-Inst Llama-3.1-405B-Inst GPT-4o-0513 Claude-3.5-Sonnet-1022
Figure 1 |Benchmark performance of DeepSeek-V3 and its counterparts.
arXiv:2412.19437v2  [cs.CL]  18 Feb 2025
{"id_": "bda2d985-93a0-46be-8d46-b70b663685ca","embedding": null,"metadata": {"page_label": "1","file_name": "deepseek-v3-1-4.pdf","file_path": "c:\\Users\\13010\\Desktop\\202505\\juke\\l0\\04llamaindex\\data\\deepseek-v3-1-4.pdf","file_type": "application/pdf","file_size": 192218,"creation_date": "2025-08-05","last_modified_date": "2025-03-12"},"excluded_embed_metadata_keys": ["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"excluded_llm_metadata_keys": ["file_name","file_type","file_size","creation_date","last_modified_date","last_accessed_date"],"relationships": {},"metadata_template": "{key}: {value}","metadata_separator": "\n","text_resource": {"embeddings": null,"text": "DeepSeek-V3 Technical Report\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total\nparameters with 37B activated for each token. To achieve efficient inference and cost-effective\ntraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-\ntures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers\nan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and\nhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to\nfully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms\nother open-source models and achieves performance comparable to leading closed-source\nmodels. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours\nfor its full training. In addition, its training process is remarkably stable. Throughout the entire\ntraining process, we did not experience any irrecoverable loss spikes or perform any rollbacks.\nThe model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.\nMMLU-Pro\n(EM)\nGPQA-Diamond\n(Pass@1)\nMATH 500\n(EM)\nAIME 2024\n(Pass@1)\nCodeforces\n(Percentile)\nSWE-bench Verified\n(Resolved)\n0\n20\n40\n60\n80\n100Accuracy / Percentile (%)\n75.9\n59.1\n90.2\n39.2\n51.6\n42.0\n66.2\n41.3\n74.7\n16.7\n35.6\n22.6\n71.6\n49.0\n80.0\n23.3 24.8 23.8\n73.3\n51.1\n73.8\n23.3\n25.3 24.5\n72.6\n49.9\n74.6\n9.3\n23.6\n38.8\n78.0\n65.0\n78.3\n16.0\n20.3\n50.8\nDeepSeek-V3 DeepSeek-V2.5 Qwen2.5-72B-Inst Llama-3.1-405B-Inst GPT-4o-0513 Claude-3.5-Sonnet-1022\nFigure 1 |Benchmark performance of DeepSeek-V3 and its counterparts.\narXiv:2412.19437v2  [cs.CL]  18 Feb 2025","path": null,"url": null,"mimetype": null},"image_resource": null,"audio_resource": null,"video_resource": null,"text_template": "{metadata_str}\n\n{content}","class_name": "Document","text": "DeepSeek-V3 Technical Report\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total\nparameters with 37B activated for each token. To achieve efficient inference and cost-effective\ntraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-\ntures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers\nan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and\nhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to\nfully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms\nother open-source models and achieves performance comparable to leading closed-source\nmodels. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours\nfor its full training. In addition, its training process is remarkably stable. Throughout the entire\ntraining process, we did not experience any irrecoverable loss spikes or perform any rollbacks.\nThe model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.\nMMLU-Pro\n(EM)\nGPQA-Diamond\n(Pass@1)\nMATH 500\n(EM)\nAIME 2024\n(Pass@1)\nCodeforces\n(Percentile)\nSWE-bench Verified\n(Resolved)\n0\n20\n40\n60\n80\n100Accuracy / Percentile (%)\n75.9\n59.1\n90.2\n39.2\n51.6\n42.0\n66.2\n41.3\n74.7\n16.7\n35.6\n22.6\n71.6\n49.0\n80.0\n23.3 24.8 23.8\n73.3\n51.1\n73.8\n23.3\n25.3 24.5\n72.6\n49.9\n74.6\n9.3\n23.6\n38.8\n78.0\n65.0\n78.3\n16.0\n20.3\n50.8\nDeepSeek-V3 DeepSeek-V2.5 Qwen2.5-72B-Inst Llama-3.1-405B-Inst GPT-4o-0513 Claude-3.5-Sonnet-1022\nFigure 1 |Benchmark performance of DeepSeek-V3 and its counterparts.\narXiv:2412.19437v2  [cs.CL]  18 Feb 2025"
}

# 在系统环境变量里配置 LLAMA_CLOUD_API_KEY=XXXfrom llama_cloud_services import LlamaParse
from llama_index.core import SimpleDirectoryReader
import nest_asyncio
nest_asyncio.apply() # 只在Jupyter笔记环境中需要此操作，否则会报错# set up parser
parser = LlamaParse(result_type="markdown"  # "markdown" and "text" are available
)
file_extractor = {".pdf": parser}documents = SimpleDirectoryReader(input_dir="./data", required_exts=[".pdf"], file_extractor=file_extractor).load_data()
print(documents[0].text)

Started parsing the file under job_id 48c6dac4-2a8a-4b18-8877-c558599ebce7
arXiv:2412.19437v2 [cs.CL] 18 Feb 2025# DeepSeek-V3 Technical Report# DeepSeek-AIresearch@deepseek.com# AbstractWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.|          | DeepSeek-V3 | DeepSeek-V2.5 | Qwen2.5-72B-Inst | Llama-3.1-405B-Inst | GPT-4o-0513 | Claude-3.5-Sonnet-1022 |   |   |
| -------- | ----------- | ------------- | ---------------- | ------------------- | ----------- | ---------------------- | - | - |
| 100      | 90.2        |               |                  |                     |             |                        |   |   |
| 80       | 75.9        | 71.6          | 73.3             | 72.6                | 78.0        |                        |   |   |
| (%)      | 66.2        | 65.0          | 59.1             |                     |             |                        |   |   |
| 60       |             | 49.0          | 51.1             | 49.9                | 51.6        | 50.8                   |   |   |
| Accuracy |             | 41.3          | 39.2             | ~~38.8~~            | 35.6        |                        |   |   |
| 20       |             |               | 16.7             | 16.0                | 20.3        |                        |   |   |
|          |             |               |                  | 9.3                 |             |                        |   |   |
| 0        | MMLU-Pro    | GPQA-Diamond  | MATH 500         | AIME 2024           | Codeforces  | SWE-bench Verified     |   |   |Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts.

2.1.2 Data Connectors

用于处理更丰富的数据类型，并将其读取为 Document 的形式。

例如：直接读取网页

# !pip install llama-index-readers-web

from llama_index.readers.web import SimpleWebPageReaderdocuments = SimpleWebPageReader(html_to_text=True).load_data(# ["https://blog.csdn.net/"]# ["https://baidu.com/"]# ["https://docs.llamaindex.org.cn/en/stable/module_guides/loading/connector/llama_parse/"]["https://cn.bing.com/"]
)
print(documents)
print(documents[0].text)

[Document(id_='cd7ba3a6-955e-455f-a66f-0432ea2497d4', embedding=None, metadata={'url': 'https://cn.bing.com/'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='[](javascript:void\\(0\\))\n\n(C) 2025 Microsoft\n\n  * [增值电信业务经营许可证：合字B2-20090007](https://dxzhgl.miit.gov.cn/dxxzsp/xkz/xkzgl/resource/qiyereport.jsp?num=caf04fa4-bd8a-4d9e-80b6-2aa1b86c1509&type=yreport)\n  * [京ICP备10036305号-7](https://beian.miit.gov.cn)\n  * [京公网安备11010802022657号](http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010802022657)\n  * [隐私与 Cookie](//go.microsoft.com/fwlink/?LinkId=521839)\n  * [法律声明](//go.microsoft.com/fwlink/?LinkID=246338)\n  * [广告](//go.microsoft.com/fwlink/?linkid=868923)\n  * [关于我们的广告](//go.microsoft.com/fwlink/?LinkID=286759)\n  * [帮助](//support.microsoft.com/topic/82d20721-2d6f-4012-a13d-d1910ccf203f)\n  * 反馈\n\n', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')]
[](javascript:void\(0\))(C) 2025 Microsoft* [增值电信业务经营许可证：合字B2-20090007](https://dxzhgl.miit.gov.cn/dxxzsp/xkz/xkzgl/resource/qiyereport.jsp?num=caf04fa4-bd8a-4d9e-80b6-2aa1b86c1509&type=yreport)* [京ICP备10036305号-7](https://beian.miit.gov.cn)* [京公网安备11010802022657号](http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010802022657)* [隐私与 Cookie](//go.microsoft.com/fwlink/?LinkId=521839)* [法律声明](//go.microsoft.com/fwlink/?LinkID=246338)* [广告](//go.microsoft.com/fwlink/?linkid=868923)* [关于我们的广告](//go.microsoft.com/fwlink/?LinkID=286759)* [帮助](//support.microsoft.com/topic/82d20721-2d6f-4012-a13d-d1910ccf203f)* 反馈

2.2 数据切分

2.2.1 TextSplitters文本切分

from llama_index.core import Document
from llama_index.core.node_parser import TokenTextSplitternode_parser = TokenTextSplitter(chunk_size=512,  # 每个 chunk 的最大长度chunk_overlap=200  # chunk 之间重叠长度
)nodes = node_parser.get_nodes_from_documents(documents, show_progress=False
)

show_json(nodes[0].json())
# show_json(nodes[2].json())

{"id_": "ba21399f-d604-4825-bd5c-c87fc84d7b6d","embedding": null,"metadata": {"url": "https://baidu.com/"},"excluded_embed_metadata_keys": [],"excluded_llm_metadata_keys": [],"relationships": {"1": {"node_id": "3954f3ae-64d8-4b4f-a936-ed97e4c70132","node_type": "4","metadata": {"url": "https://baidu.com/"},"hash": "cb3e9d6c6fc892abce4b33bf77c0c91f02dd7132b84113acbe9c28e4eb80f98d","class_name": "RelatedNodeInfo"}},"metadata_template": "{key}: {value}","metadata_separator": "\n","text": "![](//www.baidu.com/img/bd_logo1.png)\n\n[æ–°é—»](http://news.baidu.com) [hao123](http://www.hao123.com)\n[åœ°å›¾](http://map.baidu.com) [è§†é¢‘](http://v.baidu.com)\n[è´´å§](http://tieba.baidu.com)\n[ç™»å½•](http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1)\n[æ›´å¤šäº§å“](//www.baidu.com/more/)\n\n[å ³äºŽç™¾åº¦](http://home.baidu.com) [About Baidu](http://ir.baidu.com)\n\n(C)2017 Baidu [ä½¿ç”¨ç™¾åº¦å‰å¿ è¯»](http://www.baidu.com/duty/)\n[æ„è§åé¦ˆ](http://jianyi.baidu.com/) äº¬ICPè¯030173å·\n![](//www.baidu.com/img/gs.gif)","mimetype": "text/plain","start_char_idx": 0,"end_char_idx": 575,"metadata_seperator": "\n","text_template": "{metadata_str}\n\n{content}","class_name": "TextNode"
}

2.2.2 NodeParsers对有结构的文档进行解析

from llama_index.core.node_parser import HTMLNodeParser
from llama_index.readers.web import SimpleWebPageReaderdocuments = SimpleWebPageReader(html_to_text=False).load_data(# ["https://edu.guangjuke.com/tx/"]["https://cn.bing.com/"]
)# 默认解析 ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"]
parser = HTMLNodeParser(tags=["span"])  # 可以自定义解析哪些标签
nodes = parser.get_nodes_from_documents(documents)for node in nodes:print(node.text+"\n")

© 2025 Microsoft

2.3 索引和检索（Indexing and Retrieving）

2.3.1 向量索引

VectorStoreIndex 直接在内存中构建一个 Vector Store 并建索引

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter, SentenceSplitter# 加载 pdf 文档
documents = SimpleDirectoryReader("./data", required_exts=[".pdf"],
).load_data()
print(type(documents))
print(type(documents[0]))
# print(documents[0].text)
print('-'*10)
# 定义 Node Parser
node_parser = TokenTextSplitter(chunk_size=512, chunk_overlap=200)# 切分文档
nodes = node_parser.get_nodes_from_documents(documents)# 构建 index，默认是在内存中
index = VectorStoreIndex(nodes)# 另外一种实现方式
# index = VectorStoreIndex.from_documents(documents=documents, transformations=[SentenceSplitter(chunk_size=512)])# 写入本地文件
# index.storage_context.persist(persist_dir="./doc_emb")# 获取 retriever
vector_retriever = index.as_retriever(similarity_top_k=2 # 返回2个结果
)# 检索
results = vector_retriever.retrieve("deepseek v3数学能力怎么样？")print(results[0].text)

<class 'list'>
<class 'llama_index.core.schema.Document'>
----------
verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its
reasoning performance. Meanwhile, we also maintain control over the output style and
length of DeepSeek-V3.
Summary of Core Evaluation Results
• Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA,
DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9
on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source
models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source
and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3
demonstrates superior performance among open-source models on both SimpleQA and
Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual
knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese
SimpleQA), highlighting its strength in Chinese factual knowledge.
• Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on
math-related benchmarks among all non-long-CoT open-source and closed-source models.
Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500,
demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks,
DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,
such as LiveCodeBench, solidifying its position as the leading model in this domain. For
engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,
it still outpaces all other models by a significant margin, demonstrating its competitiveness
across diverse technical benchmarks.
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3
model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing
our compute clusters, the training framework, the support for FP8 training, the inference
deployment strategy, and our suggestions on future hardware design. Next, we

使用自定义的 Vector Store，以 Qdrant 为例：

# !pip install llama-index-vector-stores-qdrant

from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContextfrom qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distanceclient = QdrantClient(location=":memory:")
collection_name = "demo"
collection = client.create_collection(collection_name=collection_name,vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)vector_store = QdrantVectorStore(client=client, collection_name=collection_name)
# storage: 指定存储空间
storage_context = StorageContext.from_defaults(vector_store=vector_store)# 创建 index：通过 Storage Context 关联到自定义的 Vector Store
index = VectorStoreIndex(nodes, storage_context=storage_context)# 获取 retriever
vector_retriever = index.as_retriever(similarity_top_k=1)# 检索
results = vector_retriever.retrieve("deepseek v3数学能力怎么样")print(results[0])

Node ID: 06218c00-c6ac-42d0-baa2-6824db6a69ec
Text: verification and reflection patterns of R1 into DeepSeek-V3 and
notably improves its reasoning performance. Meanwhile, we also
maintain control over the output style and length of DeepSeek-V3.
Summary of Core Evaluation Results • Knowledge: (1) On educational
benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms
all other open-sou...
Score:  0.693

2.3.2 检索后处理

# 获取 retriever
vector_retriever = index.as_retriever(similarity_top_k=5)# 检索
nodes = vector_retriever.retrieve("deepseek v3有多少参数?")for i, node in enumerate(nodes):print(f"[{i}] {node.text}\n")

[0] and discussions (Section 5). Lastly,
we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential
directions for future research (Section 6).
2. Architecture
We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Atten-
tion (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024)
for economical training. Then, we present a Multi-Token Prediction (MTP) training objective,
which we have observed to enhance the overall performance on evaluation benchmarks. For
other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-
V2 (DeepSeek-AI, 2024c).
2.1. Basic Architecture
The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017)
framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA
and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with
DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing
6[1] DeepSeek-V3 Technical Report
DeepSeek-AI
research@deepseek.com
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms
other open-source models and achieves performance comparable to leading closed-source
models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours
for its full training. In addition, its training process is remarkably stable. Throughout the entire
training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
MMLU-Pro
(EM)
GPQA-Diamond
(Pass@1)
MATH 500
(EM)
AIME 2024
(Pass@1)
Codeforces
(Percentile)
SWE-bench Verified
(Resolved)
0
20
40
60
80
100Accuracy / Percentile (%)
75.9
59.1
90.2
39.2
51.6
42.0
66.2
41.3
74.7
16.7
35.6
22.6
71.6
49.0
80.0
23.3 24.8 23.8
73.3
51.1
73.8
23.3
25.3[2] verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its
reasoning performance. Meanwhile, we also maintain control over the output style and
length of DeepSeek-V3.
Summary of Core Evaluation Results
• Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA,
DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9
on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source
models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source
and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3
demonstrates superior performance among open-source models on both SimpleQA and
Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual
knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese
SimpleQA), highlighting its strength in Chinese factual knowledge.
• Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on
math-related benchmarks among all non-long-CoT open-source and closed-source models.
Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500,
demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks,
DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,
such as LiveCodeBench, solidifying its position as the leading model in this domain. For
engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,
it still outpaces all other models by a significant margin, demonstrating its competitiveness
across diverse technical benchmarks.
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3
model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing
our compute clusters, the training framework, the support for FP8 training, the inference
deployment strategy, and our suggestions on future hardware design. Next, we[3] open-source and closed-source models.
Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500,
demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks,
DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,
such as LiveCodeBench, solidifying its position as the leading model in this domain. For
engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,
it still outpaces all other models by a significant margin, demonstrating its competitiveness
across diverse technical benchmarks.
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3
model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing
our compute clusters, the training framework, the support for FP8 training, the inference
deployment strategy, and our suggestions on future hardware design. Next, we describe our
pre-training process, including the construction of training data, hyper-parameter settings, long-
context extension techniques, the associated evaluations, as well as some discussions (Section 4).
Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT),
Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly,
we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential
directions for future research (Section 6).
2. Architecture
We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Atten-
tion (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024)
for economical training. Then, we present a Multi-Token Prediction (MTP) training objective,
which we have observed to enhance the overall performance on evaluation benchmarks. For
other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-
V2 (DeepSeek-AI, 2024c).
2.1. Basic Architecture
The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017)
framework. For efficient[4] et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this
work, we introduce an FP8 mixed precision training framework and, for the first time, validate
its effectiveness on an extremely large-scale model. Through the support for FP8 computation
and storage, we achieve both accelerated training and reduced GPU memory usage. As for
the training framework, we design the DualPipe algorithm for efficient pipeline parallelism,
which has fewer pipeline bubbles and hides most of the communication during training through
computation-communication overlap. This overlap ensures that, as the model further scales up,
as long as we maintain a constant computation-to-communication ratio, we can still employ
fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize
InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory
footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
Combining these efforts, we achieve high training efficiency.
During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The
pre-training process is remarkably stable. Throughout the entire training process, we did not
encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage
context length extension for DeepSeek-V3. In the first stage, the maximum context length is
extended to 32K, and in the second stage, it is further extended to 128K. Following this, we
conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
on the base model of DeepSeek-V3, to align it with human preferences and further unlock its
potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-
R1 series of models, and meanwhile carefully maintain the balance between model accuracy
4

from llama_index.core.postprocessor import LLMRerankpostprocessor = LLMRerank(top_n=2)nodes = postprocessor.postprocess_nodes(nodes, query_str="deepseek v3有多少参数?")for i, node in enumerate(nodes):print(f"[{i}] {node.text}")

[0] DeepSeek-V3 Technical Report
DeepSeek-AI
research@deepseek.com
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms
other open-source models and achieves performance comparable to leading closed-source
models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours
for its full training. In addition, its training process is remarkably stable. Throughout the entire
training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
MMLU-Pro
(EM)
GPQA-Diamond
(Pass@1)
MATH 500
(EM)
AIME 2024
(Pass@1)
Codeforces
(Percentile)
SWE-bench Verified
(Resolved)
0
20
40
60
80
100Accuracy / Percentile (%)
75.9
59.1
90.2
39.2
51.6
42.0
66.2
41.3
74.7
16.7
35.6
22.6
71.6
49.0
80.0
23.3 24.8 23.8
73.3
51.1
73.8
23.3
25.3
[1] and discussions (Section 5). Lastly,
we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential
directions for future research (Section 6).
2. Architecture
We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Atten-
tion (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024)
for economical training. Then, we present a Multi-Token Prediction (MTP) training objective,
which we have observed to enhance the overall performance on evaluation benchmarks. For
other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-
V2 (DeepSeek-AI, 2024c).
2.1. Basic Architecture
The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017)
framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA
and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with
DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing
6

2.4 生成

2.4.1 单论

qa_engine = index.as_query_engine()
response = qa_engine.query("deepseek v3数学能力怎么样?")print(response)

DeepSeek-V3在数学相关基准测试中表现出色，特别是在非长链思维链（non-long-CoT）的开源和闭源模型中达到了最先进的性能。它在特定的基准测试如MATH-500上甚至超过了o1-preview，这表明它具有强大的数学推理能力。

流式输出

qa_engine = index.as_query_engine(streaming=True)
response = qa_engine.query("deepseek v3数学能力怎么样?")
response.print_response_stream()

DeepSeek-V3在数学相关基准测试中表现出色，特别是在非长链思维链（non-long-CoT）的开源和闭源模型中达到了最先进的性能。它甚至在某些特定的基准测试，如MATH-500上超越了o1-preview，这表明它具有强大的数学推理能力。

2.4.2 多轮

chat_engine = index.as_chat_engine()
response = chat_engine.chat("deepseek v3数学能力怎么样?")
print(response)

DeepSeek-V3在数学能力方面表现出色。根据文档中的信息，DeepSeek-V3在与数学相关的基准测试中，在所有非长链思维链（non-long-CoT）的开源和闭源模型中达到了最先进的性能。特别值得注意的是，它甚至在某些特定的基准测试上超过了o1-preview，例如MATH-500，这展示了其强大的数学推理能力。因此，可以说DeepSeek-V3在处理数学问题时具有很强的能力。

response = chat_engine.chat("代码能力呢?")
print(response)

Empty Response

2.5 底层接口 prompt LLM Embedding

2.5.1 prompt

from llama_index.core import PromptTemplateprompt = PromptTemplate("写一个关于{topic}的笑话")prompt.format(topic="小明")

'写一个关于小明的笑话'

from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplatechat_text_qa_msgs = [ChatMessage(role=MessageRole.SYSTEM,content="你叫{name}，你必须根据用户提供的上下文回答问题。",),ChatMessage(role=MessageRole.USER, content=("已知上下文：\n" \"{context}\n\n" \"问题：{question}")),
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)print(text_qa_template.format(name="小明",context="这是一个测试",question="这是什么")
)

system: 你叫小明，你必须根据用户提供的上下文回答问题。
user: 已知上下文：
这是一个测试问题：这是什么
assistant:

2.5.2 语言模型

from llama_index.llms.openai import OpenAIllm = OpenAI(temperature=0, model="gpt-4o")

response = llm.complete(prompt.format(topic="小明"))print(response.text)

response = llm.complete(text_qa_template.format(name="小明",context="这是一个测试",question="你是谁，我们在干嘛")
)print(response.text)

连接deepseek

# !pip install llama-index-llms-deepseek

import os
from llama_index.llms.deepseek import DeepSeekllm = DeepSeek(model="deepseek-chat", api_key=os.getenv("DEEPSEEK_API_KEY"), temperature=1.5)response = llm.complete("写个笑话")
print(response)

from llama_index.core import Settings
# 设置全局语言模型
Settings.llm = DeepSeek(model="deepseek-chat", api_key=os.getenv("DEEPSEEK_API_KEY"), temperature=1.5)

2.5.3 embedding

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings# 全局设定
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=512)

2.6 基于 LlamaIndex 实现一个 RAG 系统

功能要求：

加载指定目录的文件
支持 RAG-Fusion
使用 Qdrant 向量数据库，并持久化到本地
支持检索后排序
支持多轮对话

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, DistanceEMBEDDING_DIM = 1536
COLLECTION_NAME = "full_demo"
PATH = "./qdrant_db"client = QdrantClient(path=PATH)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, get_response_synthesizer
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.core.postprocessor import LLMRerank, SimilarityPostprocessor
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.chat_engine import CondenseQuestionChatEngine
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
from llama_index.embeddings.dashscope import DashScopeEmbedding, DashScopeTextEmbeddingModels# 1. 指定全局llm与embedding模型
Settings.llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX,api_key=os.getenv("DASHSCOPE_API_KEY"))
Settings.embed_model = DashScopeEmbedding(model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V1)# 2. 指定全局文档处理的 Ingestion Pipeline
Settings.transformations = [SentenceSplitter(chunk_size=512, chunk_overlap=200)]# 3. 加载本地文档
documents = SimpleDirectoryReader("./data").load_data()if client.collection_exists(collection_name=COLLECTION_NAME):client.delete_collection(collection_name=COLLECTION_NAME)# 4. 创建 collection
client.create_collection(collection_name=COLLECTION_NAME,vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE)
)# 5. 创建 Vector Store
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)# 6. 指定 Vector Store 的 Storage 用于 index
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context
)# 7. 定义检索后排序模型
reranker = LLMRerank(top_n=2)
# 最终打分低于0.6的文档被过滤掉
sp = SimilarityPostprocessor(similarity_cutoff=0.6)# 8. 定义 RAG Fusion 检索器
fusion_retriever = QueryFusionRetriever([index.as_retriever()],similarity_top_k=5, # 检索召回 top k 结果num_queries=3,  # 生成 query 数use_async=False,# query_gen_prompt="",  # 可以自定义 query 生成的 prompt 模板
)# 9. 构建单轮 query engine
query_engine = RetrieverQueryEngine.from_args(fusion_retriever,node_postprocessors=[reranker],response_synthesizer=get_response_synthesizer(response_mode = ResponseMode.REFINE)
)# 10. 对话引擎
chat_engine = CondenseQuestionChatEngine.from_defaults(query_engine=query_engine, # condense_question_prompt="" # 可以自定义 chat message prompt 模板
)

# 测试多轮对话
# User: deepseek v3有多少参数
# User: 每次激活多少while True:question=input("User:")if question.strip() == "":breakresponse = chat_engine.chat(question)print(f"AI: {response}")

AI: DeepSeek-V3 模型总共有 6710 亿参数，每个 token 激活 370 亿参数。
AI: DeepSeek-V3 模型每次为每个令牌激活 370 亿个参数。

2.7 workflow

工作流顾名思义是对一些列工作步骤的抽象。

LlamaIndex 的工作流是事件（event）驱动的：

工作流由 step 组成
每个 step 处理特定的事件
step 也会产生新的事件（交由后继的 step 进行处理）
直到产生 StopEvent 整个工作流结束

举个具体的例子：用自然语言查询数据库，数据库中包含多张表

工作流：

分步说明：

用户输入自然语言查询
系统先去检索跟查询相关的表
根据表的 Schema 让大模型生成 SQL
用生成的 SQL 查询数据库
根据查询结果，调用大模型生成自然语言回复

2.7.1 数据准备

# 下载 WikiTableQuestions
# WikiTableQuestions 是一个为表格问答设计的数据集。其中包含 2,108 个从维基百科提取的 HTML 表格# !wget "https://github.com/ppasupat/WikiTableQuestions/releases/download/v1.0.2/WikiTableQuestions-1.0.2-compact.zip" -O wiki_data.zip
# !unzip wiki_data.zip

import pandas as pd
from pathlib import Pathdata_dir = Path("./WikiTableQuestions/csv/200-csv")
csv_files = sorted([f for f in data_dir.glob("*.csv")])
dfs = []
for csv_file in csv_files:print(f"processing file: {csv_file}")try:df = pd.read_csv(csv_file)dfs.append(df)except Exception as e:print(f"Error parsing {csv_file}: {str(e)}")

processing file: WikiTableQuestions\csv\200-csv\0.csv
processing file: WikiTableQuestions\csv\200-csv\1.csv
processing file: WikiTableQuestions\csv\200-csv\10.csv
processing file: WikiTableQuestions\csv\200-csv\11.csv
processing file: WikiTableQuestions\csv\200-csv\12.csv
processing file: WikiTableQuestions\csv\200-csv\14.csv
processing file: WikiTableQuestions\csv\200-csv\15.csv
Error parsing WikiTableQuestions\csv\200-csv\15.csv: Error tokenizing data. C error: Expected 4 fields in line 16, saw 5processing file: WikiTableQuestions\csv\200-csv\17.csv
Error parsing WikiTableQuestions\csv\200-csv\17.csv: Error tokenizing data. C error: Expected 6 fields in line 5, saw 7processing file: WikiTableQuestions\csv\200-csv\18.csv
processing file: WikiTableQuestions\csv\200-csv\20.csv
processing file: WikiTableQuestions\csv\200-csv\22.csv
processing file: WikiTableQuestions\csv\200-csv\24.csv
processing file: WikiTableQuestions\csv\200-csv\25.csv
processing file: WikiTableQuestions\csv\200-csv\26.csv
processing file: WikiTableQuestions\csv\200-csv\28.csv
processing file: WikiTableQuestions\csv\200-csv\29.csv
processing file: WikiTableQuestions\csv\200-csv\3.csv
processing file: WikiTableQuestions\csv\200-csv\30.csv
processing file: WikiTableQuestions\csv\200-csv\31.csv
processing file: WikiTableQuestions\csv\200-csv\32.csv
processing file: WikiTableQuestions\csv\200-csv\33.csv
processing file: WikiTableQuestions\csv\200-csv\34.csv
Error parsing WikiTableQuestions\csv\200-csv\34.csv: Error tokenizing data. C error: Expected 4 fields in line 6, saw 13processing file: WikiTableQuestions\csv\200-csv\35.csv
processing file: WikiTableQuestions\csv\200-csv\36.csv
processing file: WikiTableQuestions\csv\200-csv\37.csv
processing file: WikiTableQuestions\csv\200-csv\38.csv
processing file: WikiTableQuestions\csv\200-csv\4.csv
processing file: WikiTableQuestions\csv\200-csv\41.csv
processing file: WikiTableQuestions\csv\200-csv\42.csv
processing file: WikiTableQuestions\csv\200-csv\44.csv
processing file: WikiTableQuestions\csv\200-csv\45.csv
processing file: WikiTableQuestions\csv\200-csv\46.csv
processing file: WikiTableQuestions\csv\200-csv\47.csv
processing file: WikiTableQuestions\csv\200-csv\48.csv
processing file: WikiTableQuestions\csv\200-csv\7.csv
processing file: WikiTableQuestions\csv\200-csv\8.csv
processing file: WikiTableQuestions\csv\200-csv\9.csv

为每个表生成一段文字表述（用于检索），保存在 WikiTableQuestions_TableInfo 目录

from llama_index.core.prompts import ChatPromptTemplate
from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessageclass TableInfo(BaseModel):"""Information regarding a structured table."""table_name: str = Field(..., description="table name (must be underscores and NO spaces)")table_summary: str = Field(..., description="short, concise summary/caption of the table")prompt_str = """\
Give me a summary of the table with the following JSON format.- The table name must be unique to the table and describe it while being concise. 
- Do NOT output a generic table name (e.g. table, my_table).Do NOT make the table name one of the following: {exclude_table_name_list}Table:
{table_str}Summary: """
prompt_tmpl = ChatPromptTemplate(message_templates=[ChatMessage.from_str(prompt_str, role="user")]
)llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX, api_key=os.getenv("DASHSCOPE_API_KEY"))

tableinfo_dir = "WikiTableQuestions_TableInfo"
# !mkdir {tableinfo_dir}

import jsondef _get_tableinfo_with_index(idx: int) -> str:results_gen = Path(tableinfo_dir).glob(f"{idx}_*")results_list = list(results_gen)if len(results_list) == 0:return Noneelif len(results_list) == 1:path = results_list[0]with open(path, 'r') as file:data = json.load(file) return TableInfo.model_validate(data)else:raise ValueError(f"More than one file matching index: {list(results_gen)}")table_names = set()
table_infos = []
for idx, df in enumerate(dfs):table_info = _get_tableinfo_with_index(idx)if table_info:table_infos.append(table_info)else:while True:df_str = df.head(10).to_csv()table_info = llm.structured_predict(TableInfo,prompt_tmpl,table_str=df_str,exclude_table_name_list=str(list(table_names)),)table_name = table_info.table_nameprint(f"Processed table: {table_name}")if table_name not in table_names:table_names.add(table_name)breakelse:# try againprint(f"Table name {table_name} already exists, trying again.")passout_file = f"{tableinfo_dir}/{idx}_{table_name}.json"json.dump(table_info.dict(), open(out_file, "w"))table_infos.append(table_info)

将上述表格存入 SQLite 数据库

# put data into sqlite db
from sqlalchemy import (create_engine,MetaData,Table,Column,String,Integer,
)
import re# Function to create a sanitized column name
def sanitize_column_name(col_name):# Remove special characters and replace spaces with underscoresreturn re.sub(r"\W+", "_", col_name)# Function to create a table from a DataFrame using SQLAlchemy
def create_table_from_dataframe(df: pd.DataFrame, table_name: str, engine, metadata_obj
):# Sanitize column namessanitized_columns = {col: sanitize_column_name(col) for col in df.columns}df = df.rename(columns=sanitized_columns)# Dynamically create columns based on DataFrame columns and data typescolumns = [Column(col, String if dtype == "object" else Integer)for col, dtype in zip(df.columns, df.dtypes)]# Create a table with the defined columnstable = Table(table_name, metadata_obj, *columns)# Create the table in the databasemetadata_obj.create_all(engine)# Insert data from DataFrame into the tablewith engine.connect() as conn:for _, row in df.iterrows():insert_stmt = table.insert().values(**row.to_dict())conn.execute(insert_stmt)conn.commit()# engine = create_engine("sqlite:///:memory:")
engine = create_engine("sqlite:///wiki_table_questions.db")
metadata_obj = MetaData()
for idx, df in enumerate(dfs):tableinfo = _get_tableinfo_with_index(idx)print(f"Creating table: {tableinfo.table_name}")create_table_from_dataframe(df, tableinfo.table_name, engine, metadata_obj)

Creating table: progressive_rock_album_chart_positions
Creating table: filmography_of_diane
Creating table: annual_fatalities_and_accidents_statistics
Creating table: academy_awards_1972_results
Creating table: theatrical_award_nominations_and_wins
Creating table: bad_boy_artists_album_release_summary
Creating table: south_dakota_radio_stations
Creating table: missing_persons_case_summary_1982
Creating table: chart_performance_of_singles
Creating table: kodachrome_film_types_and_dates
Creating table: bbc_radio_service_costs_2012_2013
Creating table: french_airports_usage_summary
Creating table: voter_registration_summary_by_party
Creating table: norwegian_club_performance_statistics
Creating table: triple_crown_winners_history
Creating table: grammy_awards_summary_for_artist
Creating table: boxing_fight_results_history
Creating table: historical_sports_team_performance
Creating table: yamato_population_density_summary
Creating table: voter_registration_summary_by_party_distribution
Creating table: best_actress_award_nominations_and_wins
Creating table: uk_ministerial_positions_and_titles_history
Creating table: municipality_merger_summary
Creating table: euro_2020_group_stage_results
Creating table: binary_encoding_probabilities
Creating table: monthly_climate_statistics
Creating table: italian_government_term_history
Creating table: new_mexico_government_officials
Creating table: monthly_climate_statistics_summary
Creating table: historical_rainfall_experiment_drops
Creating table: monthly_weather_statistics
Creating table: multilingual_greetings_and_phrases
Creating table: ohio_private_schools_summary
Creating table: cancer_related_genetic_factors

2.7.2 构建基础工具

创建基于表的描述的向量索引

ObjectIndex 是一个 LlamaIndex 内置的模块，通过索引 (Index）检索任意 Python 对象
这里我们使用 VectorStoreIndex 也就是向量检索，并通过 SQLTableNodeMapping 将文本描述的 node 和数据库的表形成映射
相关文档：https://docs.llamaindex.ai/en/stable/examples/objects/object_index/#the-objectindex-class

import os
from llama_index.core import Settings
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
from llama_index.embeddings.dashscope import DashScopeEmbedding, DashScopeTextEmbeddingModels
from llama_index.core.objects import (SQLTableNodeMapping,ObjectIndex,SQLTableSchema,
)
from llama_index.core import SQLDatabase, VectorStoreIndex# 设置全局模型
Settings.llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX, api_key=os.getenv("DASHSCOPE_API_KEY"))
Settings.embed_model = DashScopeEmbedding(model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V1)sql_database = SQLDatabase(engine)table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [SQLTableSchema(table_name=t.table_name, context_str=t.table_summary)for t in table_infos
]  # add a SQLTableSchema for each tableobj_index = ObjectIndex.from_objects(table_schema_objs,table_node_mapping,VectorStoreIndex,
)
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

创建 SQL 查询器

from llama_index.core.retrievers import SQLRetriever
from typing import Listsql_retriever = SQLRetriever(sql_database)def get_table_context_str(table_schema_objs: List[SQLTableSchema]):"""Get table context string."""context_strs = []for table_schema_obj in table_schema_objs:table_info = sql_database.get_single_table_info(table_schema_obj.table_name)if table_schema_obj.context_str:table_opt_context = " The table description is: "table_opt_context += table_schema_obj.context_strtable_info += table_opt_contextcontext_strs.append(table_info)return "\n\n".join(context_strs)

创建 Text2SQL 的提示词（系统默认模板），和输出结果解析器（从生成的文本中抽取SQL）

from llama_index.core.prompts.default_prompts import DEFAULT_TEXT_TO_SQL_PROMPT
from llama_index.core import PromptTemplate
from llama_index.core.llms import ChatResponsedef parse_response_to_sql(chat_response: ChatResponse) -> str:"""Parse response to SQL."""response = chat_response.message.contentsql_query_start = response.find("SQLQuery:")if sql_query_start != -1:response = response[sql_query_start:]# TODO: move to removeprefix after Python 3.9+if response.startswith("SQLQuery:"):response = response[len("SQLQuery:") :]sql_result_start = response.find("SQLResult:")if sql_result_start != -1:response = response[:sql_result_start]return response.strip().strip("```").strip()text2sql_prompt = DEFAULT_TEXT_TO_SQL_PROMPT.partial_format(dialect=engine.dialect.name
)
print(text2sql_prompt.template)

Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return the most interesting examples in the database.Never query for all the columns from a specific table, only ask for a few relevant columns given the question.Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Pay attention to which column is in which table. Also, qualify column names with the table name when needed. You are required to use the following format, each taking one line:Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer hereOnly use tables listed below.
{schema}Question: {query_str}
SQLQuery:

创建自然语言回复生成模板

response_synthesis_prompt_str = ("Given an input question, synthesize a response from the query results.\n""Query: {query_str}\n""SQL: {sql_query}\n""SQL Response: {context_str}\n""Response: "
)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str,
)

llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX, api_key=os.getenv("DASHSCOPE_API_KEY"))

2.7.3 定义工作流

from llama_index.core.workflow import (Workflow,StartEvent,StopEvent,step,Context,Event,
)# 事件：找到了数据库中相关的表
class TableRetrieveEvent(Event):"""Result of running table retrieval."""table_context_str: strquery: str# 事件：文本转为了 SQL
class TextToSQLEvent(Event):"""Text-to-SQL event."""sql: strquery: strclass TextToSQLWorkflow1(Workflow):"""Text-to-SQL Workflow that does query-time table retrieval."""def __init__(self,obj_retriever,text2sql_prompt,sql_retriever,response_synthesis_prompt,llm,*args,**kwargs) -> None:"""Init params."""super().__init__(*args, **kwargs)self.obj_retriever = obj_retrieverself.text2sql_prompt = text2sql_promptself.sql_retriever = sql_retrieverself.response_synthesis_prompt = response_synthesis_promptself.llm = llm@stepdef retrieve_tables(self, ctx: Context, ev: StartEvent) -> TableRetrieveEvent:"""Retrieve tables."""table_schema_objs = self.obj_retriever.retrieve(ev.query)table_context_str = get_table_context_str(table_schema_objs)print("====\n"+table_context_str+"\n====")return TableRetrieveEvent(table_context_str=table_context_str, query=ev.query)@stepdef generate_sql(self, ctx: Context, ev: TableRetrieveEvent) -> TextToSQLEvent:"""Generate SQL statement."""fmt_messages = self.text2sql_prompt.format_messages(query_str=ev.query, schema=ev.table_context_str)chat_response = self.llm.chat(fmt_messages)sql = parse_response_to_sql(chat_response)print("====\n"+sql+"\n====")return TextToSQLEvent(sql=sql, query=ev.query)@stepdef generate_response(self, ctx: Context, ev: TextToSQLEvent) -> StopEvent:"""Run SQL retrieval and generate response."""retrieved_rows = self.sql_retriever.retrieve(ev.sql)print("====\n"+str(retrieved_rows)+"\n====")fmt_messages = self.response_synthesis_prompt.format_messages(sql_query=ev.sql,context_str=str(retrieved_rows),query_str=ev.query,)chat_response = llm.chat(fmt_messages)return StopEvent(result=chat_response)

workflow = TextToSQLWorkflow1(obj_retriever,text2sql_prompt,sql_retriever,response_synthesis_prompt,llm,verbose=True,
)

response = await workflow.run(query="What was the year that The Notorious B.I.G was signed to Bad Boy?"
)
print(str(response))

Running step retrieve_tables
====
Table 'bad_boy_artists_album_release_summary' has columns: Act (VARCHAR), Year_signed (INTEGER), _Albums_released_under_Bad_Boy (VARCHAR), . The table description is: Summary of artists signed to Bad Boy Records and their album releases.Table 'best_actress_award_nominations_and_wins' has columns: Year (INTEGER), Award (VARCHAR), Film (VARCHAR), Result (VARCHAR), . The table description is: Summary of Best Actress award nominations and wins from 1976 to 2006.Table 'grammy_awards_summary_for_artist' has columns: Year (INTEGER), Award (VARCHAR), Work_Artist (VARCHAR), Result (VARCHAR), . The table description is: Summary of Grammy Award nominations and wins for a specific artist over the years.
====
Step retrieve_tables produced event TableRetrieveEvent
Running step generate_sql
====
SELECT Year_signed FROM bad_boy_artists_album_release_summary WHERE Act = 'The Notorious B.I.G'
====
Step generate_sql produced event TextToSQLEvent
Running step generate_response
====
[NodeWithScore(node=TextNode(id_='31adaec7-f9f9-4493-b56e-79575bfd2991', embedding=None, metadata={'sql_query': "SELECT Year_signed FROM bad_boy_artists_album_release_summary WHERE Act = 'The Notorious B.I.G'", 'result': [(1993,), (1993,)], 'col_keys': ['Year_signed']}, excluded_embed_metadata_keys=['sql_query', 'result', 'col_keys'], excluded_llm_metadata_keys=['sql_query', 'result', 'col_keys'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='[(1993,), (1993,)]', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), score=None)]
====
Step generate_response produced event StopEvent
assistant: The Notorious B.I.G. was signed to Bad Boy in the year 1993.

2.7.4 可视化工作流

# !pip install llama-index-utils-workflow

from llama_index.utils.workflow import draw_all_possible_flowsdraw_all_possible_flows(TextToSQLWorkflow1, filename="text_to_sql_table_retrieval.html"
)