RM-R1: Reward Modeling as Reasoning
[2505.02387] RM-R1: Reward Modeling as ReasoningAbstract page for arXiv paper 2505.02387: RM-R1: Reward Modeling as Reasoninghttps://arxiv.org/abs/2505.02387
1.概述
奖励模型(RMs)在大型语言模型(LLM)的后训练中扮演着关键角色,特别是在具有人类反馈的强化学习(RLHF)中,它们作为人类评估者的可扩展代理。现有的奖励建模研究可以大致分为两类:(1)基于标量的奖励模型(ScalarRM)和(2)生成式奖励模型