当前位置：首页 > news >正文

苹果开源 DiffuCoder ：用于代码生成的掩码扩散模型

news 2025/7/8 9:51:04

该软件项目伴随研究论文，DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation.

在这里插入图片描述

研究动机

基于掩码去噪模型（MDMs）的扩展，扩散大语言模型（dLLMs）如LLaDA和Dream已在多项基准测试中达到与同规模自回归（AR）大语言模型相当的性能。近期商业级dLLMs如Mercury和Gemini进一步证明，基于扩散的代码生成器在编程任务上可与顶尖AR代码模型匹敌，同时提供更快的文本生成速度。

然而，dLLMs的生成模式和训练后策略仍有待深入探索。本研究探讨以下问题：

dLLMs的生成模式与AR模型有何本质差异？
在建模不同数据类型（如代码与数学）时存在哪些差异？
dLLMs的多样性边界如何界定？训练后流程应如何设计？

在这里插入图片描述
我们采用DiffuLLaMA中的自适应方法来训练DiffuCoder，并引入了一个新指标——自回归评分，以量化dLLM生成过程中的因果模式。主要发现如下。

调查结果

由于文本的性质，dLLM 仍然表现出从左到右的偏差，但它们也可以在 AR 模型中打破这种严格的顺序。
经过预训练后，我们发现代码任务比数学任务引起的全局 AR-ness 更少。
在 dLLM 中，改变采样温度不仅会影响采样标记（如 AR 模型），还会改变生成顺序本身。

更多有趣的发现，请参考我们的原始论文！

我们提出耦合梯度奖励策略优化（Coupled-GRPO）——一种提升DiffuCoder性能的后训练方法。

快速上手

import torch
from transformers import AutoModel, AutoTokenizermodel_path = "apple/DiffuCoder-7B-cpGRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template functionTOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokensinputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")output = model.diffusion_generate(input_ids,attention_mask=attention_mask,max_new_tokens=256,output_history=True,return_dict_in_generate=True,steps=256//TOKEN_PER_STEP,temperature=0.4,top_p=0.95,alg="entropy",alg_temp=0.,
)
generations = [tokenizer.decode(g[len(p) :].tolist())for p, g in zip(input_ids, output.sequences)
]print(generations[0].split('<|dlm_pad|>')[0])

Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):return [value for value in list1 if value in list2]
```<|im_end|>

import torch
from transformers import AutoModel, AutoTokenizermodel_path = "apple/DiffuCoder-7B-Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template functionTOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokensinputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")output = model.diffusion_generate(input_ids,attention_mask=attention_mask,max_new_tokens=256,output_history=True,return_dict_in_generate=True,steps=256//TOKEN_PER_STEP,temperature=0.3,top_p=0.95,alg="entropy",alg_temp=0.,
)
generations = [tokenizer.decode(g[len(p) :].tolist())for p, g in zip(input_ids, output.sequences)
]print(generations[0].split('<|dlm_pad|>')[0])

Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):result = []for i in list1:if i in list2:result.append(i)return result
```<|im_end|>