LLMs-from-scratch :多种字节对编码(BPE)对比
代码链接:https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/02_bonus_bytepair-encoder
《从零构建大语言模型》(Build a Large Language Model From Scratch)一书的补充代码,作者:Sebastian Raschka 代码仓库:https://github.com/rasbt/LLMs-from-scratch | ![]() |
- 为本附加笔记本安装额外包依赖:请取消注释并运行下列单元格:
# pip install -r requirements-extra.txt
对比多种字节对编码(BPE)实现
使用 tiktoken
的 BPE
from importlib.metadata import versionprint("tiktoken version:", version("tiktoken"))
tiktoken version: 0.12.0
import tiktokentik_tokenizer = tiktoken.get_encoding("gpt2")text = "Hello, world. Is this-- a test?"
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
strings = tik_tokenizer.decode(integers)print(strings)
Hello, world. Is this-- a test?
print(tik_tokenizer.n_vocab)
50257
使用 GPT‑2 原始 BPE 实现
from bpe_openai_gpt2 import get_encoder, download_vocab
download_vocab()
Fetching encoder.json: 1.04Mit [00:01, 587kit/s]
Fetching vocab.bpe: 457kit [00:01, 303kit/s]
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")
integers = orig_tokenizer.encode(text)print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
strings = orig_tokenizer.decode(integers)print(strings)
Hello, world. Is this-- a test?
通过 Hugging Face transformers 使用 BPE
import transformerstransformers.__version__
'4.57.1'
from transformers import GPT2Tokenizerhf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer(strings)["input_ids"]
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
from transformers import GPT2TokenizerFasthf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")
hf_tokenizer_fast(strings)["input_ids"]
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
使用我从零实现的 BPE 分词器
import os
import sys
import io
import nbformat
typesdef import_from_notebook():def import_definitions_from_notebook(fullname, names):current_dir = os.getcwd()path = os.path.join(current_dir, "..", "05_bpe-from-scratch", fullname + ".ipynb")path = os.path.normpath(path)# 加载笔记本if not os.path.exists(path):raise FileNotFoundError(f"Notebook file not found at: {path}")with io.open(path, "r", encoding="utf-8") as f:nb = nbformat.read(f, as_version=4)# 创建一个模块来存储导入的函数和类mod = types.ModuleType(fullname)sys.modules[fullname] = mod# 遍历笔记本的代码单元,仅执行函数或类定义for cell in nb.cells:if cell.cell_type == "code":cell_code = cell.sourcefor name in names:# 检查函数或类定义if f"def {name}" in cell_code or f"class {name}" in cell_code:exec(cell_code, mod.__dict__)return modfullname = "bpe-from-scratch"names = ["BPETokenizerSimple"]return import_definitions_from_notebook(fullname, names)
imported_module = import_from_notebook()
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe")
)
integers = tokenizer_gpt2.encode(text)print(integers)
[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
快速性能基准
with open("../01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f:raw_text = f.read()
OpenAI 原始 GPT‑2 分词器
%timeit orig_tokenizer.encode(raw_text)
4.98 ms ± 56.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Tiktoken 版 OpenAI GPT‑2 分词器
%timeit tik_tokenizer.encode(raw_text)
1.15 ms ± 6.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Hugging Face 版 OpenAI GPT‑2 分词器
%timeit hf_tokenizer(raw_text)["input_ids"]
Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors11.8 ms ± 154 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]
11.9 ms ± 105 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer_fast(raw_text)["input_ids"]
Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors4.9 ms ± 26.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]
4.91 ms ± 130 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我自制的 GPT‑2 分词器(教学用途)
%timeit tokenizer_gpt2.encode(raw_text)
17.3 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)