Transformers库用法示例:解锁预训练模型的强大能力
在自然语言处理(NLP)领域,预训练模型如BERT、GPT、T5等已成为主流工具,它们通过海量文本训练获得的语言理解能力,能显著提升各类NLP任务的效果。Hugging Face推出的Transformers库,将这些强大的预训练模型进行了统一封装,提供了简单易用的API,让开发者无需深入了解模型细节就能快速应用。本文将通过具体示例,从基础用法到实战应用,带你掌握Transformers库的核心技能,轻松驾驭前沿NLP模型。
一、Transformers库简介与环境准备
1. 为什么选择Transformers?
模型丰富:集成了100多种预训练模型(BERT、GPT-2/3、T5、RoBERTa等),支持文本分类、翻译、摘要、问答等30多种任务
多框架支持:同时兼容PyTorch和TensorFlow,模型可在两个框架间无缝转换
多语言支持:覆盖100多种语言的预训练模型,包括中文专用模型(如bert-base-chinese)
易用性强:统一的API设计,加载模型和分词器仅需2行代码,极大降低使用门槛
社区活跃:持续更新前沿模型,拥有丰富的文档和示例,问题解决效率高
2. 环境搭建
Transformers库需要与深度学习框架(PyTorch或TensorFlow)配合使用,推荐使用PyTorch:
# 安装Transformers库
pip install transformers# 安装PyTorch(根据系统选择合适的版本,参考官网:https://pytorch.org/)
# 示例:Linux/MacOS CPU版
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu# 安装额外依赖(可选,用于部分功能如 pipelines 可视化)
pip install sentencepiece # 用于T5等模型的分词
pip install accelerate # 用于模型并行和混合精度训练
验证安装是否成功:
import transformers
import torch
print(f"Transformers版本:{transformers.__version__}")
print(f"PyTorch版本:{torch.__version__}")
二、核心概念:模型(Model)与分词器(Tokenizer)
Transformers库的核心是**模型**和**分词器**的组合:
分词器(Tokenizer):负责将文本转换为模型可理解的输入格式(如词向量、注意力掩码),不同模型有专属的分词器
模型(Model):预训练模型本体,接收分词器的输出,进行特征提取或预测
1. 加载预训练模型和分词器
以BERT模型为例,展示基础用法:
from transformers import BertTokenizer, BertModel# 加载预训练模型和对应的分词器
# 模型名称可从Hugging Face Hub获取:https://huggingface.co/models
model_name = "bert-base-uncased" # 基础版BERT,不区分大小写
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)# 查看模型结构(可选)
print("模型结构:")
print(model)
常用中文模型示例:
# 中文BERT模型
chinese_model_name = "bert-base-chinese"
chinese_tokenizer = BertTokenizer.from_pretrained(chinese_model_name)
chinese_model = BertModel.from_pretrained(chinese_model_name)
2. 文本预处理(分词器的使用)
分词器的主要功能是将文本转换为模型输入的张量(tensor):
# 示例文本
text = "Hugging Face Transformers is an amazing library for NLP."# 分词并转换为输入格式
inputs = tokenizer(text,padding=True, # 填充到最长序列长度truncation=True, # 截断超过模型最大长度的文本return_tensors="pt" # 返回PyTorch张量("tf"表示TensorFlow)
)# 查看分词结果
print("分词器输出:")
print(inputs)
# 输出包含:
# - input_ids:词对应的ID
# - token_type_ids:句子类型(用于区分上下句)
# - attention_mask:注意力掩码(0表示填充,1表示有效词)# 查看分词后的词语
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("\n分词结果:")
print(tokens)
输出结果(部分):
分词器输出:
{'input_ids': tensor([[ 101, 17662, 2227, 19081, 2003, 2019, 3403, 2005, 1037, 4563, 1012, 102]]),'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}分词结果:
['[CLS]', 'hugging', 'face', 'transformers', 'is', 'an', 'amazing', 'library', 'for', 'nlp', '.', '[SEP]']
注意:BERT等模型有特殊标记,如
[CLS]
(句首)和[SEP]
(句尾/分隔),分词器会自动添加。
三、基础任务:使用Pipeline快速上手
Transformers提供了pipeline
API,将模型、分词器和后处理逻辑封装在一起,一行代码即可完成常见NLP任务:
from transformers import pipeline# 1. 情感分析(默认使用distilbert-base-uncased-finetuned-sst-2-english)
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love using Transformers! It makes NLP so easy.")
print("情感分析结果:", result)# 2. 文本分类(可自定义模型)
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
print("文本分类结果:", classifier("This movie is terrible."))# 3. 命名实体识别(NER)
ner_pipeline = pipeline("ner", grouped_entities=True) # grouped_entities=True合并同一实体
text = "Albert Einstein was born in Ulm, Germany in 1879."
print("命名实体识别结果:", ner_pipeline(text))# 4. 问答系统(需要上下文和问题)
question_answerer = pipeline("question-answering")
context = "Hugging Face is a company based in New York City. Its headquarters are in DUMBO, a neighborhood in Brooklyn."
question = "Where is Hugging Face based?"
print("问答结果:", question_answerer(question=question, context=context))# 5. 文本摘要
summarizer = pipeline("summarization", model="t5-small")
text = """
Hugging Face is a company that specializes in NLP. They have created a library called Transformers
that provides access to many pre-trained models. These models can be used for various tasks
like text classification, translation, and summarization.
"""
print("文本摘要结果:", summarizer(text, max_length=50, min_length=20, do_sample=False))# 6. 翻译(英语到法语)
translator = pipeline("translation", model="t5-small", src_lang="en", tgt_lang="fr")
print("翻译结果:", translator("Hello, how are you?", max_length=40))
中文任务示例:
# 中文情感分析(使用中文模型)
chinese_sentiment = pipeline("sentiment-analysis", model="uer/roberta-base-finetuned-dianping-chinese")
print("中文情感分析:", chinese_sentiment("这家餐厅的食物非常美味,服务也很周到!"))# 中文文本生成(使用GPT-2中文模型)
generator = pipeline("text-generation", model="uer/gpt2-chinese-cluecorpussmall")
print("中文文本生成:", generator("人工智能的发展趋势是", max_length=30, num_return_sequences=1))
四、进阶用法:自定义模型推理与微调
1. 手动进行模型推理
对于更精细的控制,可手动调用模型和分词器进行推理:
import torch
from transformers import BertTokenizer, BertForSequenceClassification# 加载用于情感分析的BERT模型
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)# 输入文本
text = "This course is very informative."# 预处理
inputs = tokenizer(text, return_tensors="pt")# 模型推理(不计算梯度,提高速度)
with torch.no_grad():outputs = model(**inputs)# 解析结果
logits = outputs.logits
predicted_class_id = logits.argmax().item()
# 映射类别ID到标签
result = model.config.id2label[predicted_class_id]print(f"文本:{text}")
print(f"预测类别:{result}")
2. 模型微调(Fine-tuning)
预训练模型在特定任务上的效果可能有限,需要通过微调适配具体场景。以下是文本分类任务的微调示例:
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset # 需要安装datasets库:pip install datasets
import numpy as np
import evaluate# 1. 加载数据集(使用IMDb电影评论数据集)
dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")# 2. 预处理函数
def preprocess_function(examples):return tokenizer(examples["text"], truncation=True, max_length=512)# 应用预处理
tokenized_dataset = dataset.map(preprocess_function, batched=True)# 3. 加载模型(指定分类类别数)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)# 4. 定义评估指标
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):logits, labels = eval_predpredictions = np.argmax(logits, axis=1)return metric.compute(predictions=predictions, references=labels)# 5. 设置训练参数
training_args = TrainingArguments(output_dir="./bert-imdb", # 输出目录learning_rate=2e-5, # 学习率per_device_train_batch_size=8, # 训练批次大小per_device_eval_batch_size=8, # 评估批次大小num_train_epochs=3, # 训练轮数logging_dir="./logs", # 日志目录logging_steps=100,evaluation_strategy="epoch", # 每轮评估一次save_strategy="epoch", # 每轮保存一次模型load_best_model_at_end=True, # 训练结束后加载最佳模型
)# 6. 初始化Trainer
trainer = Trainer(model=model,args=training_args,train_dataset=tokenized_dataset["train"],eval_dataset=tokenized_dataset["test"],compute_metrics=compute_metrics,
)# 7. 开始训练
trainer.train()# 8. 评估模型
eval_results = trainer.evaluate()
print("评估结果:", eval_results)
微调完成后,可加载保存的模型进行推理:
# 加载微调后的模型
fine_tuned_model = BertForSequenceClassification.from_pretrained("./bert-imdb")# 推理示例
text = "This movie was fantastic! The acting was incredible."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():outputs = fine_tuned_model(** inputs)
predicted_class = outputs.logits.argmax().item()
print("预测类别(0=负面,1=正面):", predicted_class)
五、实战案例:中文文本分类系统
下面实现一个基于BERT中文模型的新闻分类系统,对新闻文本进行类别预测(如体育、娱乐、科技等):
import torch
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np# 1. 数据准备(示例数据,实际可从文件加载)
# 新闻类别:0=体育,1=娱乐,2=科技,3=财经
data = {"text": ["国足3-0战胜对手,取得世预赛开门红","某知名演员宣布结婚,粉丝纷纷送上祝福","新一代人工智能芯片发布,性能提升50%","央行下调利率,释放流动性支持实体经济","奥运会乒乓球男单决赛,马龙成功卫冕","热门电视剧收视率破纪录,成为年度爆款","量子计算研究取得重大突破,有望加速AI发展","股市今日大涨,科技板块领涨市场"],"label": [0, 1, 2, 3, 0, 1, 2, 3]
}
df = pd.DataFrame(data)# 划分训练集和测试集
train_texts, test_texts, train_labels, test_labels = train_test_split(df["text"].tolist(), df["label"].tolist(), test_size=0.2, random_state=42
)# 2. 加载中文BERT模型和分词器
model_name = "bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4) # 4个类别# 3. 文本预处理
def preprocess(texts):return tokenizer(texts,padding=True,truncation=True,max_length=128,return_tensors="pt")train_encodings = preprocess(train_texts)
test_encodings = preprocess(test_texts)# 4. 构建数据集类
class NewsDataset(torch.utils.data.Dataset):def __init__(self, encodings, labels):self.encodings = encodingsself.labels = labelsdef __getitem__(self, idx):item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}item["labels"] = torch.tensor(self.labels[idx])return itemdef __len__(self):return len(self.labels)train_dataset = NewsDataset(train_encodings, train_labels)
test_dataset = NewsDataset(test_encodings, test_labels)# 5. 配置训练参数
from transformers import TrainingArguments, Trainertraining_args = TrainingArguments(output_dir="./chinese-news-classifier",num_train_epochs=5,per_device_train_batch_size=2,per_device_eval_batch_size=2,warmup_steps=500,weight_decay=0.01,logging_dir="./logs",logging_steps=10,evaluation_strategy="epoch",save_strategy="epoch",load_best_model_at_end=True,
)# 6. 定义评估指标
def compute_metrics(eval_pred):logits, labels = eval_predpredictions = np.argmax(logits, axis=1)accuracy = (predictions == labels).mean()return {"accuracy": accuracy}# 7. 训练模型
trainer = Trainer(model=model,args=training_args,train_dataset=train_dataset,eval_dataset=test_dataset,compute_metrics=compute_metrics,
)trainer.train()# 8. 模型推理
def predict_news_category(text):# 加载最佳模型best_model = BertForSequenceClassification.from_pretrained("./chinese-news-classifier")inputs = tokenizer(text, return_tensors="pt")with torch.no_grad():outputs = best_model(**inputs)pred_label = outputs.logits.argmax().item()# 类别映射categories = ["体育", "娱乐", "科技", "财经"]return categories[pred_label]# 测试预测
test_news = ["中国男篮击败日本队,晋级亚洲杯四强","新上映的科幻电影票房突破10亿","5G技术在工业领域的应用取得新进展"
]
for news in test_news:print(f"新闻:{news}")print(f"预测类别:{predict_news_category(news)}")print("---")
六、模型部署:将Transformers模型集成到应用中
训练好的模型可通过以下方式部署到实际应用:
1. 简单API部署(使用Flask)
from flask import Flask, request, jsonify
from transformers import pipelineapp = Flask(__name__)
# 加载情感分析模型
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")@app.route("/analyze", methods=["POST"])
def analyze_sentiment():data = request.jsontext = data.get("text", "")if not text:return jsonify({"error": "请提供文本"}), 400result = sentiment_analyzer(text)[0]return jsonify({"text": text,"label": result["label"],"score": float(result["score"])})if __name__ == "__main__":app.run(host="0.0.0.0", port=5000)
2. 高效部署(使用ONNX或TorchScript)
对于生产环境,可将模型转换为ONNX格式或TorchScript,提高推理速度:
# 转换为TorchScript
from transformers import BertModel, BertTokenizermodel = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")# 生成示例输入
inputs = tokenizer("Hello, world!", return_tensors="pt")
traced_model = torch.jit.trace(model, (inputs["input_ids"], inputs["attention_mask"]))# 保存模型
traced_model.save("bert_traced.pt")# 加载并使用
loaded_model = torch.jit.load("bert_traced.pt")
outputs = loaded_model(inputs["input_ids"], inputs["attention_mask"])
七、常用模型推荐与资源
1. 英文模型
文本分类:
distilbert-base-uncased-finetuned-sst-2-english
(轻量高效)问答:
bert-large-uncased-whole-word-masking-finetuned-squad
(高精度)文本生成:
gpt2
(基础版)、facebook/opt-1.3b
(开源替代GPT-3)翻译:
t5-small
(多语言翻译)
2. 中文模型
通用:
bert-base-chinese
(基础中文BERT)、hfl/chinese-roberta-wwm-ext
(增强版)文本分类:
uer/roberta-base-finetuned-dianping-chinese
(情感分析)文本生成:
uer/gpt2-chinese-cluecorpussmall
(中文GPT-2)命名实体识别:
hfl/chinese-bert-wwm-ext
(需微调)
3. 学习资源
Hugging Face官网:https://huggingface.co/(模型库和文档)
Transformers文档:https://huggingface.co/docs/transformers/(详细教程)
Hugging Face课程:https://huggingface.co/learn(免费交互式课程)
八、总结
Transformers库彻底改变了NLP的开发方式,让普通开发者也能轻松使用最先进的预训练模型。本文从基础的模型加载、文本预处理,到进阶的模型微调、部署应用,覆盖了Transformers库的核心用法。通过pipeline
API可快速实现常见任务,而手动调用模型和分词器则提供了更灵活的控制。
在实际应用中,建议根据任务类型和资源限制选择合适的模型(小型模型如DistilBERT适合部署,大型模型如GPT-2适合需要高精度的场景)。同时,微调是提升特定任务效果的关键,结合datasets
库可简化数据处理流程。
随着NLP领域的快速发展,Transformers库也在不断更新,新模型和功能层出不穷。掌握这个强大的工具,将为你的NLP项目带来质的飞跃,无论是科研探索还是商业应用,都能抢占技术先机。