当前位置：首页 > news >正文

二、transformers基础组件之Tokenizer

news 2025/7/2 19:23:26

在使用神经网络处理自然语言处理任务时，我们首先需要对数据进行预处理，将数据从字符串转换为神经网络可以接受的格式，一般会分为如下几步:


- Step1 分词:使用分词器对文本数据进行分词(字、字词); - Step2 构建词典:根据数据集分词的结果,构建词典映射(这一步并不绝对,如果采用预训练词向量,词典映射要根据词向量文件进行处理); - Step3 数据转换:根据构建好的词典,将分词处理后的数据做映射,将文本序列转换为数字序列; - Step4 数据填充与截断:在以batch输入到模型的方式中,需要对过短的数据进行填充,过长的数据进行截断, 保证数据长度符合模型能接受的范围,同时batch内的数据维度大小一致。

- Step1 分词:使用分词器对文本数据进行分词(字、字词);
- Step2 构建词典:根据数据集分词的结果,构建词典映射(这一步并不绝对,如果采用预训练词向量,词典映射要根据词向量文件进行处理);
- Step3 数据转换:根据构建好的词典,将分词处理后的数据做映射,将文本序列转换为数字序列;
- Step4 数据填充与截断:在以batch输入到模型的方式中,需要对过短的数据进行填充,过长的数据进行截断, 保证数据长度符合模型能接受的范围,同时batch内的数据维度大小一致。

在transformers工具包中，只需要借助Tokenizer模块便可以快速的实现上述全部工作，它的功能就是将文本转换为神经网络可以处理的数据。Tokenizer工具包无需额外安装，会随着transformers一起安装。

1 Tokenizer 基本使用（对单条数据进行处理）

Tokenizer 基本使用:
(1) 加载保存(from_pretrained / save_pretrained)
(2) 句子分词(tokenize)
(3) 查看词典 (vocab)
(4) 索引转换(convert_tokens_to_ids / convert_ids_to_tokens)
(5) 填充截断(padding / truncation)
(6) 其他输入(attention_mask / token_type_ids)

不同的模型会对应不同的tokenizer。transformers中需要导入AutoTokenizer，AutoTokenizer会根据不同的模型导入对应的tokenizer

1.1. Tokenizer的加载与保存

from transformers import AutoTokenizer
# 从HuggingFace加载，输入模型名称，即可加载对于的分词器
tokenizer = AutoTokenizer.from_pretrained("../models/roberta-base-finetuned-dianping-chinese")

tokenizer打印结果如下：

BertTokenizerFast(name_or_path='/root/autodl-fs/models/roberta-base-finetuned-dianping-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', 
special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

将tokenizer 保存到本地后可以直接从本地加载

# tokenizer 保存到本地
tokenizer.save_pretrained("./roberta_tokenizer")
# 从本地加载tokenizer
tokenizer = AutoTokenizer.from_pretrained("./roberta_tokenizer/")

1.2. 句子分词

sen = "弱小的我也有大梦想!"
tokens = tokenizer.tokenize(sen)
#['弱', '小', '的', '我', '也', '有', '大', '梦', '想', '!']

1.3. 查看字典

tokenizer.vocab
tokenizer.vocab_size  #21128

{'##椪': 16551,'##瀅': 17157,'##蝕': 19128,'##魅': 20848,'##隱': 20460,'儆': 1024,'嫣': 2073,'簧': 5082,'镁': 7250,'maggie': 12423,'768': 12472,'1921': 10033,'焜': 4191,'渎': 3932,'鎂': 7110,'谥': 6471,'app': 8172,'噱': 1694,'goo': 9271,'345': 11434,'##rin': 13250,'ᆷ': 324,'redis': 12599,'／': 8027,'莅': 5797,
...'##轰': 19821,'387': 12716,'##齣': 21031,'##ント': 10002,...}
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

1.4. 索引转换

# 将词序列转换为id序列
ids = tokenizer.convert_tokens_to_ids(tokens)
#[2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106]# 将id序列转换为token序列
tokens = tokenizer.convert_ids_to_tokens(ids)
#['弱', '小', '的', '我', '也', '有', '大', '梦', '想', '!']# 将token序列转换为string
str_sen = tokenizer.convert_tokens_to_string(tokens)
#'弱 小 的 我 也 有 大 梦 想!'

上面的把各个步骤都分开了，简洁的方法：

# 将字符串转换为id序列，又称之为编码
ids = tokenizer.encode(sen, add_special_tokens=True)
# [101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106, 102]# 将id序列转换为字符串，又称之为解码
str_sen = tokenizer.decode(ids, skip_special_tokens=False)
# '[CLS] 弱 小 的 我 也 有 大 梦 想! [SEP]'

下面的方法比上面多了两个special_tokens：[CLS]和[SEP]，这是特定tokenizer给定的，句子开头和句子结尾

1.5. 填充与截断

以上是针对单条数据，如果是针对多条数据会涉及到填充和截断。把短的数据补齐，长的数据截断。

# 填充
ids = tokenizer.encode(sen, padding="max_length", max_length=15)
# [101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106, 102, 0, 0, 0]
# 截断
ids = tokenizer.encode(sen, max_length=5, truncation=True)
# [101, 2483, 2207, 4638, 102]

填充和阶段的长度包含了两个special_tokens：[CLS]和[SEP]

1.6. 其他输入部分

数据处理里面还有一个地方要特别注意：既然数据中存在着填充，就要告诉模型，哪些是填充，哪些是有效的数据。这个时候需要attention_mask。也就是说，对于上面的例子，ids为0的元素对应的attention_mask的元素应为0，不为0的元素位置attention_mask的元素应为0。
另外香BERT这样的模型需要token_type_ids标定是第几个句子。

1.6.1 手动生成方式（体现定义）

如果是手动设定attention_mask和token_type_ids如下：

attention_mask = [1 if idx != 0 else 0 for idx in ids]
token_type_ids = [0] * len(ids)
#ids:   ([101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106, 102, 0, 0, 0],
#attention_mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
#token_type_ids:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

1.6.2 快速调用方式

transformers肯定不会让手动生成，有自动的方法。就是encode_plus

inputs = tokenizer.encode_plus(sen, padding="max_length", max_length=15)

使用encode_plus生成的是一个字典如下：

{
'input_ids': [101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106, 102, 0, 0, 0], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
}

‘input_ids’：分此后转换的id
‘token_type_ids’：属于第几个句子
‘attention_mask’：1对应的’input_ids’元素是非填充元素，是真实有效的

调用encode_plus方法名字太长了，不好记，等效的方法就是可以直接用 tokenizer而不使用任何方法。

inputs = tokenizer(sen, padding="max_length", max_length=15)

得到和上面一样的结果：

{
'input_ids': [101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 106, 102, 0, 0, 0], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
}

2 处理batch数据

上面是处理单条数据，现在看看怎么处理batch数据

sens = ["弱小的我也有大梦想","有梦想谁都了不起","追逐梦想的心，比梦想本身，更可贵"]
res = tokenizer(sens)
#{'input_ids': [[101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 3457, 2682, 102], [101, 3300, 3457, 2682, 6443, 6963, 749, 679, 6629, 102], [101, 6841, 6852, 3457, 2682, 4638, 2552, 8024, 3683, 3457, 2682, 3315, 6716, 8024, 3291, 1377, 6586, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

sens是个列表有3个句子，得到的res是一个字典，包含：‘input_ids’；‘token_type_ids’；‘attention_mask’，每个都是一个列表，列表的每个元素对应每个句子’input_ids’；‘token_type_ids’；‘attention_mask’

以batch的方式去处理比一个一个的处理速度更快。

3 Fast/Slow Tokenizer

Tokenizer有快的版本和慢的版本

3.1 FastTokenizer

基于Rust实现,速度快
offsets_mapping、 word_ids SlowTokenizer

sen = "弱小的我也有大Dreaming!"
fast_tokenizer = AutoTokenizer.from_pretrained(model_path)

请添加图片描述

FastTokenizer会有一些特殊的返回。比如return_offsets_mapping，用来指示每个token对应到原文的起始位置。

inputs = fast_tokenizer(sen, return_offsets_mapping=True)

{'input_ids': [101, 2483, 2207, 4638, 2769, 738, 3300, 1920, 10252, 8221, 106, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # "弱小的我也有大Dreaming!"
# Dreaming被分为两个词(7, 12), (12, 15)
# 该字段中保存着每个token对应到原文中的起始与结束位置
'offset_mapping': [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 12), (12, 15), (15, 16), (0, 0)]
}

# word_ids方法，该方法会返回分词后token序列对应原始实际词的索引，特殊标记的值为None。
# Dreaming被分为两个词【7, 7】
inputs.word_ids()
# [None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

3.2 SlowTokenizer

基于Python实现,速度慢

slow_tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

请添加图片描述两种方式的时间对比如下：

请添加图片描述

4 特殊Tokenizer的加载

一些非官方实现的模型，自己实现了tokenizer，如果想用他们自己实现的tokenizer，需要指定trust_remote_code=True。

from transformers import AutoTokenizer# 需要设置trust_remote_code=True
# 表示信任远程代码
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
tokenizer

即使保存到本地，然后从本地加载也要设置trust_remote_code=True

tokenizer.save_pretrained("chatglm_tokenizer")# 需要设置trust_remote_code=True
tokenizer = AutoTokenizer.from_pretrained("chatglm_tokenizer", trust_remote_code=True)