苹果开源 DiffuCoder :用于代码生成的掩码扩散模型
该软件项目伴随研究论文,DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation.
研究动机
基于掩码去噪模型(MDMs)的扩展,扩散大语言模型(dLLMs)如LLaDA和Dream已在多项基准测试中达到与同规模自回归(AR)大语言模型相当的性能。近期商业级dLLMs如Mercury和Gemini进一步证明,基于扩散的代码生成器在编程任务上可与顶尖AR代码模型匹敌,同时提供更快的文本生成速度。
然而,dLLMs的生成模式和训练后策略仍有待深入探索。本研究探讨以下问题:
- dLLMs的生成模式与AR模型有何本质差异?
- 在建模不同数据类型(如代码与数学)时存在哪些差异?
- dLLMs的多样性边界如何界定?训练后流程应如何设计?
我们采用DiffuLLaMA中的自适应方法来训练DiffuCoder,并引入了一个新指标——自回归评分,以量化dLLM生成过程中的因果模式。主要发现如下。
调查结果
-
由于文本的性质,dLLM 仍然表现出从左到右的偏差,但它们也可以在 AR 模型中打破这种严格的顺序。
-
经过预训练后,我们发现代码任务比数学任务引起的全局 AR-ness 更少。
-
在 dLLM 中,改变采样温度不仅会影响采样标记(如 AR 模型),还会改变生成顺序本身。
更多有趣的发现,请参考我们的原始论文!
我们提出耦合梯度奖励策略优化(Coupled-GRPO)——一种提升DiffuCoder性能的后训练方法。
快速上手
import torch
from transformers import AutoModel, AutoTokenizermodel_path = "apple/DiffuCoder-7B-cpGRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template functionTOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokensinputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")output = model.diffusion_generate(input_ids,attention_mask=attention_mask,max_new_tokens=256,output_history=True,return_dict_in_generate=True,steps=256//TOKEN_PER_STEP,temperature=0.4,top_p=0.95,alg="entropy",alg_temp=0.,
)
generations = [tokenizer.decode(g[len(p) :].tolist())for p, g in zip(input_ids, output.sequences)
]print(generations[0].split('<|dlm_pad|>')[0])
Here is the code to solve this problem:
```python
def shared_elements(list1, list2):return [value for value in list1 if value in list2]
```<|im_end|>
import torch
from transformers import AutoModel, AutoTokenizermodel_path = "apple/DiffuCoder-7B-Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template functionTOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokensinputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")output = model.diffusion_generate(input_ids,attention_mask=attention_mask,max_new_tokens=256,output_history=True,return_dict_in_generate=True,steps=256//TOKEN_PER_STEP,temperature=0.3,top_p=0.95,alg="entropy",alg_temp=0.,
)
generations = [tokenizer.decode(g[len(p) :].tolist())for p, g in zip(input_ids, output.sequences)
]print(generations[0].split('<|dlm_pad|>')[0])
Here is the code to solve this problem:
```python
def shared_elements(list1, list2):result = []for i in list1:if i in list2:result.append(i)return result
```<|im_end|>
代码
https://github.com/apple/ml-diffucoder