Transformers中不同的generation strategies生成策略
Transformers中不同的generation strategies生成策略
- Basic decoding methods
- Greedy search
- Sampling
- Beam search
- Advanced decoding methods
- Speculative decoding
Basic decoding methods
Greedy search
贪心策略是一个默认的解码方式,其操作就是每次选择distribution中概率最大的内容,对于短文本生成任务,并且创造性不优先考虑的情况下,会考虑使用贪心解码,但是当文本开始变长后使用这种方法会重复生成一些内容。
Sampling
这种方法是按照next-token distribution从整个词表中采样,这代表词表中概率不为0的词都有机会被选中,Sampling的方法会减少重复的情况,同时生成的内容更加具有创造性和变化性。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Acceleratordevice = Accelerator().devicetokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, num_beams=1)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'
Beam search
束搜索会在每个时间步追踪多个生成序列 (beams)
- 在生成一定长度后,它会选择整体概率最高的序列作为最终输出,和贪心不同的是,束搜索具有一定的前瞻性,即使初始token的概率低,也可能选择一个整体概率更高的序列。
- 在transformers库中,通过
num_beams
这个参数来开启
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Acceleratordevice = Accelerator().devicetokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=2)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
"['Hugging Face is an open-source company that develops and maintains the Hugging Face platform, which is a collection of tools and libraries for building and deploying natural language processing (NLP) models. Hugging Face was founded in 2018 by Thomas Wolf']"
Advanced decoding methods
Speculative decoding
投机采样不是search或sampling strategy,而是额外增加一个小模型去生成候选candidate tokens
,main model通过一次forward pass
去验证生成的结果,从而整体上加速。
需要注意的一点是transformers库中的投机采样是不支持batch的
transformers库是通过一个assistant_model
的参数启动的
from transformers import AutoModelForCausalLM, AutoTokenizertokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt")outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that provides a platform for developers to build and deploy machine'