当前位置：首页 > news >正文

RLHF奖励模型的训练

news 2025/7/26 10:07:21

由于 RLHF 的训练过程中需要依赖大量的人类偏好数据进行学习，因此很难在训练过程中要求人类标注者实时提供偏好反馈。为此，我们需要训练一个模型来替代人类在 RLHF 训练过程中实时提供反馈，这个模型被称为奖励模型

🔸一、目标函数公式解释

公式如下：

$-\mathbb{E}_{(x, y^+, y^-) \sim D} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-)) \right] - \beta \mathbb{E}_{(x, y^+)\sim D} \left[ \sum_{t=1}^{T} \log p(y^+_t \mid x, y^+_{<t}) \right]$

含义拆解：

x: 输入（如问题或提示语）
y+: 正例响应（由人类标注或偏好选择的答案）
y-: 负例响应（不好的答案）
r_θ(x, y): 奖励模型对 (x, y) 的打分（通常是最后一个 token 的输出经过 reward head 得到）
σ: Sigmoid 函数
β: 权重超参，控制模仿学习（第二项）对总损失的影响程度

公式两部分含义：

对比损失（ranking loss）：

$$
- \log \sigma(r(x, y^+) - r(x, y^-))
  $$
- 目标是使 正例得分 > 负例得分
- 当 r(x, y+) ≫ r(x, y-) 时，sigmoid接近1，log接近0 → 损失小，说明模型学得好
模仿学习损失（语言模型 loss）：

$$
- \sum_{t=1}^{T} \log p(y^+t \mid x, y^+{<t})
  $$
- 即：语言模型在给定输入 x 和前缀 y^+_{<t} 的条件下，预测下一个 token 的交叉熵损失
- 起正则作用，防止奖励模型过度拟合打分而丧失语言生成能力

🔸二、代码结构分析

基于 LLaMA 的奖励模型实现详解（逐行解读 + PyTorch 源码分析）

📦 模块导入

1  import torch
2  import torch.nn as nn
3  import torch.nn.functional as F
4
5  from transformers import LlamaForCausalLM

torch：PyTorch 核心包
nn：用于定义神经网络模块（如 Linear）
F：包含函数式接口（如 loss 函数）
LlamaForCausalLM：来自 Transformers 的 LLaMA 语言模型基类，支持自回归文本生成

🧠 模型定义：奖励模型类

7  class LlamaRewardModel(LlamaForCausalLM):
8      def __init__(self, config):
9          super().__init__(config)
10
11         # 初始化线性变换层，将隐状态映射为标量，用于输出最终奖励
12         self.reward_head = nn.Linear(config.hidden_size, 1, bias=False)

LlamaRewardModel 继承自 HuggingFace 的 LlamaForCausalLM
增加了一个 reward_head 线性层，用于将模型输出（hidden state）映射为 奖励值（scalar）

🧾 正例/负例打分函数 `_forward_rmloss`

14 def _forward_rmloss(self, input_ids, attention_mask, **kargs):
18     output = self.model.forward(
19         input_ids=input_ids,
20         attention_mask=attention_mask,
21         return_dict=True,
22         use_cache=False
23     )
25     logits = self.reward_head(output.last_hidden_state).squeeze(-1)
26     return logits

输入：拼接后的 [x, y] 序列
self.model.forward(...)：获得 LLaMA 模型输出（hidden states）
self.reward_head(...)：只对最后一层 hidden state 应用线性映射，输出奖励值
squeeze(-1)：去除最后一维 [batch, 1] -> [batch]

squeeze(-1) 的作用是去掉张量的最后一个维度，前提是该维度的值是 1。
假设 logits 是一个 [batch_size, 1] 的张量：
logits = tensor([[0.73], [0.24], [0.91]]) # shape: [3, 1]
执行：
logits = logits.squeeze(-1)
结果为：
tensor([0.73, 0.24, 0.91]) # shape: [3]

✍️ 模仿学习损失函数 `_forward_lmloss`

29 def _forward_lmloss(self, prompt_ids, lm_attn_mask, response_ids):
35     outputs = self.model.forward(
36         input_ids=prompt_ids,
37         attention_mask=lm_attn_mask,
38         return_dict=True,
39         use_cache=False,
40     )
42     hidden_states = outputs.last_hidden_state
43     logits = self.lm_head(hidden_states)
44     loss_fct = nn.CrossEntropyLoss()
45     logits = logits.view(-1, self.config.vocab_size)
46     response_ids = response_ids.view(-1)
47     loss = loss_fct(logits, response_ids)
48     return loss

prompt_ids 是 [x, y⁺] 拼接后的 token ID
输出 logits：维度 [batch_size, seq_len, vocab_size]
计算交叉熵损失：对所有位置预测的 token 与 response_ids 进行对比

🚀 前向传播函数：组合损失计算

50 def forward(self, sent1_idx, attention_mask_1, sent2_idx,attention_mask_2, labels, prompt_ids, lm_attn_mask, response_ids):

参数说明：

sent1_idx: [x, y⁺] 拼接输入（正例）
sent2_idx: [x, y⁻] 拼接输入（负例）
labels: 全 0 标签，用于对比损失
prompt_ids: 与正例相关的 token（用于 LM Loss）
response_ids: 正例的 target token（用于 LM Loss）

计算对比损失（Reward Loss）

61 reward0 = self._forward_rmloss(sent1_idx, attention_mask_1)
66 reward1 = self._forward_rmloss(sent2_idx, attention_mask_2)
71 logits = reward0 - reward1
72 rm_loss = F.binary_cross_entropy_with_logits(logits,labels.to(logits.dtype), reduction="mean")

分别计算 r(x, y⁺) 与 r(x, y⁻)
构造 logits = r⁺ - r⁻
用 Binary Cross Entropy Loss 计算 reward loss

公式对应：
$-\log(\sigma(r(x, y⁺) - r(x, y⁻)))$

计算语言模型损失（Language Modeling Loss）

75 lm_loss = self._forward_lmloss(prompt_ids, lm_attn_mask, response_ids)

与传统语言模型训练一致，使用 CrossEntropyLoss

返回总损失

78 loss = rm_loss + lm_loss
79 return loss

二者直接加和（可选加权项 β，可自己加参数）
模型即同时优化打分能力 + 文本生成能力（联合学习）

🔸四、总结

项目	描述
核心思想	同时学习奖励模型 `r_θ` 和保持生成流畅性
优势	1. 保留强化学习能力 2. 不失语义与流畅性
应用场景	RLHF 的 reward 模型训练阶段，如 OpenAI 的 GPT 训练流程中 `Step 2: Train Reward Model`
可调参数	`β` 控制生成质量与偏好打分之间的权衡