当前位置: 首页 > news >正文

LLMs-from-scratch :多种字节对编码(BPE)对比

代码链接:https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/02_bonus_bytepair-encoder

《从零构建大语言模型》(Build a Large Language Model From Scratch)一书的补充代码,作者:Sebastian Raschka

代码仓库:https://github.com/rasbt/LLMs-from-scratch
  • 为本附加笔记本安装额外包依赖:请取消注释并运行下列单元格:
# pip install -r requirements-extra.txt

对比多种字节对编码(BPE)实现


 

使用 tiktoken 的 BPE

from importlib.metadata import versionprint("tiktoken version:", version("tiktoken"))
tiktoken version: 0.12.0
import tiktokentik_tokenizer = tiktoken.get_encoding("gpt2")text = "Hello, world. Is this-- a test?"
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
strings = tik_tokenizer.decode(integers)print(strings)
Hello, world. Is this-- a test?
print(tik_tokenizer.n_vocab)
50257

 

使用 GPT‑2 原始 BPE 实现

from bpe_openai_gpt2 import get_encoder, download_vocab
download_vocab()
Fetching encoder.json: 1.04Mit [00:01, 587kit/s]                                                    
Fetching vocab.bpe: 457kit [00:01, 303kit/s]                                                        
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")
integers = orig_tokenizer.encode(text)print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
strings = orig_tokenizer.decode(integers)print(strings)
Hello, world. Is this-- a test?

 

通过 Hugging Face transformers 使用 BPE

import transformerstransformers.__version__
'4.57.1'
from transformers import GPT2Tokenizerhf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer(strings)["input_ids"]
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
from transformers import GPT2TokenizerFasthf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")
hf_tokenizer_fast(strings)["input_ids"]
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

 

使用我从零实现的 BPE 分词器

import os
import sys
import io
import nbformat
typesdef import_from_notebook():def import_definitions_from_notebook(fullname, names):current_dir = os.getcwd()path = os.path.join(current_dir, "..", "05_bpe-from-scratch", fullname + ".ipynb")path = os.path.normpath(path)# 加载笔记本if not os.path.exists(path):raise FileNotFoundError(f"Notebook file not found at: {path}")with io.open(path, "r", encoding="utf-8") as f:nb = nbformat.read(f, as_version=4)# 创建一个模块来存储导入的函数和类mod = types.ModuleType(fullname)sys.modules[fullname] = mod# 遍历笔记本的代码单元,仅执行函数或类定义for cell in nb.cells:if cell.cell_type == "code":cell_code = cell.sourcefor name in names:# 检查函数或类定义if f"def {name}" in cell_code or f"class {name}" in cell_code:exec(cell_code, mod.__dict__)return modfullname = "bpe-from-scratch"names = ["BPETokenizerSimple"]return import_definitions_from_notebook(fullname, names)
imported_module = import_from_notebook()
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe")
)
integers = tokenizer_gpt2.encode(text)print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

 

快速性能基准

with open("../01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f:raw_text = f.read()

OpenAI 原始 GPT‑2 分词器

%timeit orig_tokenizer.encode(raw_text)
4.98 ms ± 56.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Tiktoken 版 OpenAI GPT‑2 分词器

%timeit tik_tokenizer.encode(raw_text)
1.15 ms ± 6.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Hugging Face 版 OpenAI GPT‑2 分词器

%timeit hf_tokenizer(raw_text)["input_ids"]
Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors11.8 ms ± 154 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]
11.9 ms ± 105 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer_fast(raw_text)["input_ids"]
Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors4.9 ms ± 26.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]
4.91 ms ± 130 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我自制的 GPT‑2 分词器(教学用途)

%timeit tokenizer_gpt2.encode(raw_text)
17.3 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
http://www.dtcms.com/a/499336.html

相关文章:

  • 济南哪里有网站建设公司网站类网站开发源代码
  • 做笔记的网站源码wordpress手机版论坛
  • 网站推广有哪些举措域名需要跟网站名称一致么
  • 具身神经-机器人通讯架构与实现系列
  • [GO]gin框架:ShouldBindJSON与其他常见绑定方法
  • KUKA库卡焊接机器人二氧化碳节气
  • 机器人、具身智能的起步——线性系统理论|【三】线性、因果与时不变
  • 服务器做php网站吗wordpress评论贴图
  • 网站建设与管理的心得怎样做音乐网站
  • 请例举 Android 中常用布局类型,并简述其用法以及排版效率
  • Android 约束布局(ConstraintLayout)的权重机制:用法与对比解析
  • 编程与数学 03-007 《看潮资源管理器》项目开发 07 主窗口设计(3-3)
  • 基于单片机的架空线路接地故障检测与报警系统
  • 鸿蒙实现滴滴出行项目之乘客支付订单功能
  • 如何把自己做的网站放到网上360建筑网怎样取消发布的消息
  • 做网站有哪个空间网站建设优化推广贵州
  • 西电25年A测 语音识别机械臂方案与教程
  • 数据结构——队列的链式存储结构
  • 媒体135网站口碑好的宜昌网站建设
  • 湖南省建设银行网站官网深圳龙华网站建设公司
  • 网站后台管理系统源码网站空间文件夹
  • 元宇宙与公共服务的深度融合:重构民生服务的效率与温度
  • 深入解析十字链表:从理论到实践的全面指南
  • 红色页面网站护肤品网站建设的摘要
  • GB28181视频服务wvp部署(一)
  • 吴忠住房和城乡建设局网站小学生编程网课前十名
  • 浅谈 OpenAPI Schema—— 接口契约的标准语言
  • TSDF 体素模型与光线投射
  • 【学习笔记】利用meshlab进行曲面的质量检查
  • S2--单链表