当前位置：首页 > news >正文

LLM - 开源强化学习框架 OpenR1 的环境配置与训练参数教程

news 2025/10/22 2:45:32

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/146838740

免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。

OpenR1

OpenR1 是一个开源的强化学习框架，复现 DeepSeek-R1 的训练流程，为研究人员和开发者提供了一个完整的推理优化训练工具链。该项目由 Hugging Face 发起，通过开源的方式，详细展示了从知识蒸馏到强化学习，再到多阶段训练的完整过程。OpenR1 包含了用于训练和评估模型以及生成合成数据的脚本，支持 GRPO 训练、监督微调（SFT）等多种训练方法。它还封装了多个开源框架，如 TRL 和 distilabel，方便用户快速上手。通过开源代码、模型和数据集，OpenR1 为推理领域开源社区的发展奠定了基础。

安装依赖库 trl / lighteval / flash-attn 和配置 open-r1 环境，即：

pip install setuptools

# 安装 TRL 库
# trl @ git+https://github.com/huggingface/trl.git@69ad852e5654a77f1695eb4c608906fe0c7e8624
git clone https://github.com/huggingface/trl.git
cd trl
git checkout 69ad852e5654a77f1695eb4c608906fe0c7e8624
pip install --no-build-isolation -e "."
pip show trl
# Version: 0.16.0.dev0

# 安装 lighteval 库
# lighteval @ git+https://github.com/huggingface/lighteval.git@ed084813e0bd12d82a06d9f913291fdbee774905
git clone https://github.com/huggingface/lighteval.git
cd lighteval
git checkout ed084813e0bd12d82a06d9f913291fdbee774905
pip install --no-build-isolation -e "."
pip show lighteval
# Version: 0.6.0.dev0

# 安装 flash-attn
git clone https://github.com/Dao-AILab/flash-attention.git
python setup.py install
pip show flash-attn
# Version: 2.7.4.post1

cd [your path]/llm/open-r1
pip install --no-build-isolation -e "."

其中 accelerate 冲突与修复版本 1.3.0，即：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mergekit 0.0.6 requires accelerate~=1.3.0, but you have accelerate 1.4.0 which is incompatible.

pip install accelerate==1.3.0

准备模型 Qwen/Qwen2.5-1.5B-Instruct 与训练集 OpenR1-Math-220k 和 Bespoke-Stratos-17k，即：

huggingface-cli download --token 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 Qwen/Qwen2.5-1.5B-Instruct --local-dir Qwen/Qwen2.5-1.5B-Instruct
huggingface-cli download --token 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 Qwen/Qwen2.5-VL-7B-Instruct --local-dir Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download --repo-type dataset --token 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 open-r1/OpenR1-Math-220k --local-dir open-r1/OpenR1-Math-220k
huggingface-cli download --repo-type dataset --token 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 HuggingFaceH4/Bespoke-Stratos-17k --local-dir HuggingFaceH4/Bespoke-Stratos-17k

其中，模型与数据集的路径：

[your path]/huggingface/Qwen/Qwen2.5-1.5B-Instruct/
[your path]/llm/openr1_datasets/open-r1/OpenR1-Math-220k/
[your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k/

使用 SFT 训练 Open-R1，即：

accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path "[your path]/huggingface/Qwen/Qwen2.5-1.5B-Instruct" \
    --dataset_name "[your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k" \
    --learning_rate 1.0e-5 \
    --num_train_epochs 1 \
    --packing \
    --max_seq_length 8096 \
    --per_device_train_batch_size 1 \
    --gradient_checkpointing \
    --bf16 \
    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
    
# [your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k  # 训练较快
# [your path]/llm/openr1_datasets/open-r1/OpenR1-Math-220k

注意：per_device_train_batch_size 直接影响显存占用。

强化学习 GRPO 的训练范式：

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
    --num_processes=7 src/open_r1/grpo.py \
    --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

其中，zero3.yaml 配置文件，如下：

compute_environment: LOCAL_MACHINE  # 指定运行环境为本地机器
debug: false  # 是否开启调试模式（false 表示关闭调试）
deepspeed_config:  # DeepSpeed 配置部分
  deepspeed_multinode_launcher: standard  # 指定 DeepSpeed 多节点启动方式（标准启动）
  offload_optimizer_device: none  # 优化器卸载设备（none 表示不卸载到其他设备）
  offload_param_device: none  # 模型参数卸载设备（none 表示不卸载到其他设备）
  zero3_init_flag: true  # 是否启用 Zero3 初始化标志（true 表示启用）
  zero3_save_16bit_model: true  # 是否保存 16 位模型（true 表示保存）
  zero_stage: 3  # 指定 Zero 阶段（使用 Zero-3）
distributed_type: DEEPSPEED  # 分布式训练类型（使用 DeepSpeed）
downcast_bf16: 'no'  # 是否将浮点数下转换为 bf16（no 表示不转换）
machine_rank: 0  # 机器的排名（通常用于多节点训练，0 表示单节点）
main_training_function: main  # 主训练函数的名称
mixed_precision: bf16  # 混合精度类型（使用 bfloat16）
num_machines: 1  # 使用的机器数量（1 表示单机）
num_processes: 8  # 每台机器上的进程数（8 表示每个节点运行 8 个进程）
rdzv_backend: static  # 分布式训练的后端类型（使用静态后端）
same_network: true  # 是否使用同一网络（true 表示所有节点在同一网络）
tpu_env: []  # TPU 环境变量（未启用 TPU）
tpu_use_cluster: false  # 是否使用 TPU 集群（false 表示不使用）
tpu_use_sudo: false  # 是否使用 sudo 权限运行 TPU（false 表示不使用）
use_cpu: false  # 是否使用 CPU 进行训练（false 表示不使用 CPU，通常使用 GPU 或 TPU）

模型训练的过程：第 0 个卡先处理数据，其余 7 个卡再处理数据，即：

Applying chat template to train dataset		# 01:02
Tokenizing train dataset		# 16:45
Packing train dataset  			# 27:59

其中，accelerate launch 训练方式，通过 Accelerator 类，简化分布式训练和混合精度训练的配置和实现，专注于模型的开发，无需过多关注硬件和分布式环境。

Accelerator 自动处理分布式环境的配置，将模型、优化器和数据加载器，传递给 accelerator.prepare() 方法，即在任何分布式设置 (包括单 GPU、多 GPU、TPU) 上运行。
Accelerator 自动管理设备分配，数据加载器，自动将数据发送到正确的设备，无需手动调用 .to(device)。
通过配置参数，如 --mixed_precision=fp16，即可启用混合精度训练，自动处理梯度缩放等细节。
提供内置的梯度累积功能，即 accelerator.accumulate()，和梯度裁剪功能，即 accelerator.clip_grad_norm_()。
通过 accelerate config 工具，快速配置训练环境，通过命令行参数，如 --num_processes 和 --mixed_precision，灵活调整。
无需关心底层硬件细节，相同的代码可以在不同硬件上运行，减少了代码的复杂性和维护成本。

训练数据格式，包括 system 和 conversations 字段，user 是问题，assistant 是答案，如下：

{
	'system': "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. ...",
	'conversations': 
	array([{
				'from': 'user',
				'value': 'Return your final response within \\boxed{}. The operation $\\otimes$ is defined for all nonzero numbers by $a\\otimes b =\\frac{a^{2}}{b}$. ...'
			},
			{
				'from': 'assistant',
				'value': "<|begin_of_thought|>\n\nOkay, let me try to figure out this problem. So, we have this operation defined as a⊗b = a²/b. And we need to compute [(1⊗2)⊗3] - [1⊗(2⊗3)]. ...<|end_of_thought|>\n\n<|begin_of_solution|>\n\nTo determine the value of \\([(1 \\otimes 2) \\otimes 3] - [1 \\otimes (2 \\otimes 3)]\\) where the operation \\(\\otimes\\) is defined by \\(a \\otimes b = \\frac{a^2}{b}\\)..., the answer is \\(\\boxed{A}\\).\n\n<|end_of_solution|>"
			}
		],
		dtype = object)
}

参数：

# 模型相关参数
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B  # 模型名称或路径，使用 DeepSeek-R1-Distill-Qwen-1.5B 模型[^1^]
model_revision: main  # 模型版本，指定使用的模型分支
torch_dtype: bfloat16  # 模型的计算精度，使用 bfloat16 提升计算效率[^5^]
attn_implementation: flash_attention_2  # 注意力机制的实现方式，使用 flash_attention_2 提升推理效率[^1^]

# 数据训练相关参数
# 编辑 DeepSeek 聊天模板，确保推理时包含推理过程的 <think> 标签内容，并且格式奖励正常工作
chat_template: "
{% if not add_generation_prompt is defined %}
		{% set add_generation_prompt = false %}
{% endif %}
{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}
{%- for message in messages %}
		{%- if message['role'] == 'system' %}
				{% set ns.system_prompt = message['content'] %}
		{%- endif %}
{%- endfor %}
{{bos_token}}{{ns.system_prompt}}
{%- for message in messages %}
			{%- if message['role'] == 'user' %}
					{%- set ns.is_tool = false -%}
					{{'<｜User｜>' + message['content']}}
			{%- endif %}
			{%- if message['role'] == 'assistant' and message['content'] is none %}
					{%- set ns.is_tool = false -%}
					{%- for tool in message['tool_calls'] %}
							{%- if not ns.is_first %}
									{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}
									{%- set ns.is_first = true -%}
							{%- else %}
									{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}
									{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}
							{%- endif %}
					{%- endfor %}
			{%- endif %}
{%- if message['role'] == 'assistant' and message['content'] is not none %}
		{%- if ns.is_tool %}
				{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}
				{%- set ns.is_tool = false -%}
		{%- else %}
				{% set content = message['content'] %}
				{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}
		{%- endif %}
{%- endif %}
{%- if message['role'] == 'tool' %}
		{%- set ns.is_tool = true -%}
		{%- if ns.is_output_first %}
				{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
				{%- set ns.is_output_first = false %}
		{%- else %}
				{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
		{%- endif %}
{%- endif %}
{%- endfor -%}
{% if ns.is_tool %}
{{'<｜tool▁outputs▁end｜>'}}
{% endif %}
{% if add_generation_prompt and not ns.is_tool %}
{{'<｜Assistant｜>'}}
{% endif %}
"
dataset_name: open-r1/OpenR1-Math-220k  # 使用的数据集名称，这里为 OpenR1-Math-220k[^2^]
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"  # 系统提示，定义模型的回复风格和格式[^2^]

# GRPO 训练器配置
bf16: true  # 是否使用 bfloat16 精度进行训练[^5^]
use_vllm: true  # 是否使用 vLLM 引擎进行高效推理[^5^]
vllm_device: auto  # 自动选择 vLLM 的设备[^5^]
vllm_gpu_memory_utilization: 0.7  # GPU 内存利用率设置为 0.7[^5^]
do_eval: false  # 是否进行评估，这里设置为不进行评估
gradient_accumulation_steps: 4  # 梯度累积步数，用于在小批量训练时累积梯度[^2^]
gradient_checkpointing: true  # 是否启用梯度检查点，以减少显存占用[^2^]
gradient_checkpointing_kwargs:
  use_reentrant: false  # 梯度检查点的参数，设置为非重入模式，避免递归调用导致的显存问题
hub_model_id: DeepSeek-R1-Distill-Qwen-1.5B-GRPO  # 模型在 Hugging Face Hub 上的 ID，用于模型保存和推送
hub_strategy: every_save  # 推送到 Hugging Face Hub 的策略，每次保存时推送
learning_rate: 1.0e-06  # 学习率，设置为 1e-6，用于控制模型参数更新的速度
log_completions: true  # 是否记录生成的完成文本，用于调试和分析
log_level: info  # 日志级别，设置为 info，记录重要信息
logging_first_step: true  # 是否记录第一步的日志，便于观察训练初期情况
logging_steps: 1  # 每隔多少步记录一次日志，这里设置为每步记录
logging_strategy: steps  # 日志记录策略，按步记录
lr_scheduler_type: cosine_with_min_lr  # 学习率调度器类型，使用余弦退火调度器并设置最小学习率
lr_scheduler_kwargs:  # 学习率调度器的参数
  min_lr_rate: 0.1  # 最小学习率比例，设置为 0.1
max_prompt_length: 512  # 最大提示长度，限制输入文本的长度
max_completion_length: 2048  # 最大生成长度，限制生成文本的长度
max_steps: -1  # 最大训练步数，设置为 -1 表示按 epoch 训练
num_generations: 16  # 每次生成的样本数量，用于评估和调试
num_train_epochs: 1  # 训练的轮数，设置为 1 轮
output_dir: data/DeepSeek-R1-Distill-Qwen-1.5B-GRPO  # 输出目录，用于保存训练结果
overwrite_output_dir: true  # 是否覆盖输出目录，方便重新训练
per_device_eval_batch_size: 16  # 每个设备的评估批量大小
per_device_train_batch_size: 16  # 每个设备的训练批量大小
push_to_hub: true  # 是否将模型推送到 Hugging Face Hub
report_to:  # 报告工具，用于记录训练过程
- wandb  # 使用 Weights & Biases 进行可视化和报告
reward_funcs:  # 奖励函数列表，用于优化生成文本的质量
- accuracy  # 准确性奖励
- format  # 格式奖励
- tag_count  # 标签数量奖励
reward_weights:  # 奖励权重，每个奖励函数的权重均为 1.0
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"  # 保存策略，按轮保存模型
save_total_limit: 1  # 最大保存模型数量，限制为 1 个
seed: 42  # 随机种子，用于确保结果可复现
temperature: 0.7  # 生成温度，控制生成文本的多样性
warmup_ratio: 0.1  # 预热比例，用于学习率预热