当前位置：首页 > news >正文

LLMs-from-scratch ：多种字节对编码（BPE）对比

news 2025/10/21 11:14:41

代码链接：https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/02_bonus_bytepair-encoder

《从零构建大语言模型》（Build a Large Language Model From Scratch）一书的补充代码，作者：Sebastian Raschka

代码仓库：https://github.com/rasbt/LLMs-from-scratch

为本附加笔记本安装额外包依赖：请取消注释并运行下列单元格：

# pip install -r requirements-extra.txt

对比多种字节对编码（BPE）实现

使用 `tiktoken` 的 BPE

from importlib.metadata import versionprint("tiktoken version:", version("tiktoken"))

tiktoken version: 0.12.0

import tiktokentik_tokenizer = tiktoken.get_encoding("gpt2")text = "Hello, world. Is this-- a test?"

integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

strings = tik_tokenizer.decode(integers)print(strings)

Hello, world. Is this-- a test?

print(tik_tokenizer.n_vocab)

使用 GPT‑2 原始 BPE 实现

from bpe_openai_gpt2 import get_encoder, download_vocab

download_vocab()

Fetching encoder.json: 1.04Mit [00:01, 587kit/s]                                                    
Fetching vocab.bpe: 457kit [00:01, 303kit/s]

orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

integers = orig_tokenizer.encode(text)print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

strings = orig_tokenizer.decode(integers)print(strings)

Hello, world. Is this-- a test?

通过 Hugging Face transformers 使用 BPE

import transformerstransformers.__version__

'4.57.1'

from transformers import GPT2Tokenizerhf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

hf_tokenizer(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

from transformers import GPT2TokenizerFasthf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")

hf_tokenizer_fast(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

使用我从零实现的 BPE 分词器

import os
import sys
import io
import nbformat
typesdef import_from_notebook():def import_definitions_from_notebook(fullname, names):current_dir = os.getcwd()path = os.path.join(current_dir, "..", "05_bpe-from-scratch", fullname + ".ipynb")path = os.path.normpath(path)# 加载笔记本if not os.path.exists(path):raise FileNotFoundError(f"Notebook file not found at: {path}")with io.open(path, "r", encoding="utf-8") as f:nb = nbformat.read(f, as_version=4)# 创建一个模块来存储导入的函数和类mod = types.ModuleType(fullname)sys.modules[fullname] = mod# 遍历笔记本的代码单元，仅执行函数或类定义for cell in nb.cells:if cell.cell_type == "code":cell_code = cell.sourcefor name in names:# 检查函数或类定义if f"def {name}" in cell_code or f"class {name}" in cell_code:exec(cell_code, mod.__dict__)return modfullname = "bpe-from-scratch"names = ["BPETokenizerSimple"]return import_definitions_from_notebook(fullname, names)

imported_module = import_from_notebook()
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe")
)

integers = tokenizer_gpt2.encode(text)print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

快速性能基准

with open("../01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f:raw_text = f.read()

OpenAI 原始 GPT‑2 分词器

%timeit orig_tokenizer.encode(raw_text)

4.98 ms ± 56.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Tiktoken 版 OpenAI GPT‑2 分词器

%timeit tik_tokenizer.encode(raw_text)

1.15 ms ± 6.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Hugging Face 版 OpenAI GPT‑2 分词器

%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors11.8 ms ± 154 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

11.9 ms ± 105 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit hf_tokenizer_fast(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors4.9 ms ± 26.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]

4.91 ms ± 130 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我自制的 GPT‑2 分词器（教学用途）

%timeit tokenizer_gpt2.encode(raw_text)

17.3 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

查看全文

http://www.dtcms.com/a/499336.html

济南哪里有网站建设公司网站类网站开发源代码

做笔记的网站源码wordpress手机版论坛

网站推广有哪些举措域名需要跟网站名称一致么

具身神经-机器人通讯架构与实现系列

[GO]gin框架:ShouldBindJSON与其他常见绑定方法

KUKA库卡焊接机器人二氧化碳节气

机器人、具身智能的起步——线性系统理论|【三】线性、因果与时不变

服务器做php网站吗wordpress评论贴图

网站建设与管理的心得怎样做音乐网站

请例举 Android 中常用布局类型，并简述其用法以及排版效率

Android 约束布局（ConstraintLayout）的权重机制：用法与对比解析

编程与数学 03-007 《看潮资源管理器》项目开发 07 主窗口设计（3-3）

基于单片机的架空线路接地故障检测与报警系统

鸿蒙实现滴滴出行项目之乘客支付订单功能

如何把自己做的网站放到网上360建筑网怎样取消发布的消息

做网站有哪个空间网站建设优化推广贵州

西电25年A测语音识别机械臂方案与教程

数据结构——队列的链式存储结构

媒体135网站口碑好的宜昌网站建设

湖南省建设银行网站官网深圳龙华网站建设公司

网站后台管理系统源码网站空间文件夹

元宇宙与公共服务的深度融合：重构民生服务的效率与温度

深入解析十字链表：从理论到实践的全面指南

红色页面网站护肤品网站建设的摘要

GB28181视频服务wvp部署(一)

吴忠住房和城乡建设局网站小学生编程网课前十名

浅谈 OpenAPI Schema—— 接口契约的标准语言

TSDF 体素模型与光线投射

【学习笔记】利用meshlab进行曲面的质量检查

S2--单链表