当前位置：首页 > news >正文

【论文速递】2025年第19周(May-04-10)(Robotics/Embodied AI/LLM)

news 2025/9/22 8:15:47

中文使用 googletrans 翻译，翻译不对的地方以英文为准

感知，理性，思考和计划：大型多模式推理模型的调查
- 英文摘要
- 中文摘要
绝对零：用零数据加强自我播放推理
- 英文摘要
- 中文摘要
通过加强微调，统一的多模式链奖励奖励模型
- 英文摘要
- 中文摘要
在野外Grokking：使用变压器的真实世界多跳跃推理的数据增强
- 英文摘要
- 中文摘要
Voila：实时自主互动和语音角色扮演的语音语言基础模型
- 英文摘要
- 中文摘要
Flow-GRPO：通过在线RL的训练流匹配模型
- 英文摘要
- 中文摘要
在通往多模式通才的道路上：通用级别和一般基础
- 英文摘要
- 中文摘要
统一的多模式理解和发电模型：进步，挑战和机遇
- 英文摘要
- 中文摘要
RM-R1：奖励建模作为推理
- 英文摘要
- 中文摘要
ZeroSearch：激励LLM的搜索能力而无需搜索
- 英文摘要
- 中文摘要
Pixelhacker：与结构和语义一致性的图像介绍
- 英文摘要
- 中文摘要
美洲驼：有效的推理模型
- 英文摘要
- 中文摘要
MUON预处理的实用效率
- 英文摘要
- 中文摘要
关于大语模型推理引擎的调查：优化和效率的观点
- 英文摘要
- 中文摘要
通过增强学习的LLM的代理推理和工具集成
- 英文摘要
- 中文摘要
Hunyuancustom：定制视频生成的多模式驱动架构
- 英文摘要
- 中文摘要
RADLADS：快速注意力蒸馏到线性注意解码器上
- 英文摘要
- 中文摘要
形式上：大型语言模型的正式数学推理基准测试
- 英文摘要
- 中文摘要
通过图层内存提高图像生成的编辑性
- 英文摘要
- 中文摘要
从文本中生成物理稳定且可建造的乐高设计
- 英文摘要
- 中文摘要
OpenVision：一个完全开放的，具有成本效益的高级视觉编码家族，用于多模式学习
- 英文摘要
- 中文摘要
灵活性：在异质场景中朝着灵活的动作控制
- 英文摘要
- 中文摘要
retroinfer：可扩展长篇小写LLM推理的矢量存储方法
- 英文摘要
- 中文摘要
R1奖励：通过稳定的增强学习培训多模式奖励模型
- 英文摘要
- 中文摘要
作为法官的知觉代理：评估大语言模型中的高阶社交认知
- 英文摘要
- 中文摘要
原始作品：自动回归变压器的人工制作的3D原始组装产生
- 英文摘要
- 中文摘要

感知，理性，思考和计划：大型多模式推理模型的调查

标题: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
作者: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
日期: 2025-05-08
ArXiv主页: https://arxiv.org/abs/2505.04921
论文链接: https://arxiv.org/pdf/2505.04921
项目链接: https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models
gitHub仓库: https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models

英文摘要

Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

中文摘要

推理是智力的核心，塑造了做出决策，得出结论并跨越领域的能力。在人工智能中，随着系统越来越多地在开放，不确定和多模式环境中运行，推理对于实现鲁棒和适应性行为至关重要。大型多模式推理模型（LMRMS）已成为一种有希望的范式，将诸如文本，图像，音频和视频之类的模式集成在一起，以支持复杂的推理能力，并旨在获得全面的感知，精确的理解和深刻的推理。随着研究的发展，多模式推理从模块化，感知驱动的管道迅速发展为统一的以语言为中心的框架，这些框架提供了更连贯的跨模式理解。尽管指导调整和强化学习改善了模型推理，但Omni-Modal概括，推理深度和代理行为仍然存在重大挑战。为了解决这些问题，我们介绍了多模式推理研究的全面和结构化的调查，该研究围绕四阶段的发展路线图组织，反映了该领域的不断变化的设计理念和新兴能力。首先，我们回顾了基于特定于任务的模块的早期努力，在该模块中，推理被隐式地嵌入了代表，对齐和融合的阶段。接下来，我们研究了将推理统一为多模式LLM的最新方法，并具有多模式链（MCOT）和多模式增强学习等进步，从而实现了更丰富，更结构化的推理链。最后，利用了挑战性基准和OpenAI O3和O4-Mini的实验案例的经验见解，我们讨论了本地大型多模式推理模型（N-LMRMS）的概念方向，旨在支持复杂的，现实世界中的可扩展，代理和自适应推理和计划。

绝对零：用零数据加强自我播放推理

标题: Absolute Zero: Reinforced Self-play Reasoning with Zero Data
作者: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
日期: 2025-05-06
ArXiv主页: https://arxiv.org/abs/2505.03335
论文链接: https://arxiv.org/pdf/2505.03335
项目链接: https://andrewzh112.github.io/absolute-zero-reasoner/
gitHub仓库: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

中文摘要

通过可验证的奖励（RLVR）的增强学习已通过直接从基于结果的奖励中学习来增强大语言模型的推理能力的希望。最近在零设置下运行的RLVR工作避免在标记推理过程时进行监督，但仍取决于手动策划的问题和答案的培训收集。高质量，人类制作的例子的稀缺性引起了人们对依靠人类监督的长期可扩展性的担忧，这是在预处理语言模型领域已经明显的挑战。此外，在AI超过人类智能的假设未来中，人类提供的任务可能为超智能系统提供有限的学习潜力。为了解决这些问题，我们提出了一个新的RLVR范式，称为“绝对零”，其中一个模型学会了提出任务，以最大程度地提高自己的学习进度，并通过解决这些任务，而无需依赖任何外部数据来提高推理。在此范式下，我们介绍了绝对的零推理器（AZR），该系统通过使用代码执行人来验证提出的代码推理任务并验证答案，从而自我改善其培训课程和推理能力，并作为可验证的奖励来源，以指导开放式的学习来指导开放式学习。尽管受过外部数据的培训，但AZR在编码和数学推理任务方面取得了总体SOTA性能，优于现有的零设定模型，这些模型依靠数以万计的人类策划的示例。此外，我们证明AZR可以在不同的模型尺度上有效应用，并且与各种模型类兼容。

通过加强微调，统一的多模式链奖励奖励模型

标题: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
日期: 2025-05-06
ArXiv主页: https://arxiv.org/abs/2505.03318
论文链接: https://arxiv.org/pdf/2505.03318
项目链接: https://codegoat24.github.io/UnifiedReward/think
gitHub仓库: https://github.com/CodeGoat24/UnifiedReward

英文摘要

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model’s latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model’s cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model’s prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model’s reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

中文摘要

多模式奖励模型（RMS）的最新进展在提供奖励信号以使视力模型与人类偏好相结合时表现出了巨大的希望。但是，当前的RMS通常仅限于提供直接响应或参与深度有限的浅水推理过程，通常导致奖励信号不准确。我们认为，将明确的思想链（COT）纳入奖励推理过程可以显着增强其可靠性和鲁棒性。此外，我们认为，一旦RMS内部化COT推理，它们的直接响应精度也可以通过隐式推理能力提高。为此，本文提出了Unifiedreward-Ink，这是第一个统一的基于多模式的奖励模型，能够具有多维，分步长链推理，以实现视觉理解和生成奖励任务。具体而言，我们采用探索驱动的加固方法来引起并激励该模型的潜在复杂推理能力：（1）我们首先使用少量图像生成偏好数据来提炼GPT-4O的推理过程，然后将其用于模型的冷启动，以学习COT推理的格式和结构。（2）随后，通过利用模型的先验知识和概括功能，我们准备大规模的统一多模式偏好数据，以在各种视觉任务中启动模型的推理过程。在此阶段，保留了正确的推理输出以进行排斥采样以完善模型（3），而最终将不正确的预测样品用于基于组的相对策略优化（GRPO）基于基于的强化微调，从而使该模型能够探索多样的推理路径并优化了正确和可靠的解决方案。各种视觉奖励任务的广泛实验证明了我们模型的优势。

在野外Grokking：使用变压器的真实世界多跳跃推理的数据增强

标题: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
作者: Roman Abramov, Felix Steinbauer, Gjergji Kasneci
日期: 2025-04-29
ArXiv主页: https://arxiv.org/abs/2504.20752
论文链接: https://arxiv.org/pdf/2504.20752

英文摘要

Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio phi_r of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing phi_r drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

中文摘要

变形金刚在众多NLP任务中取得了巨大的成功，但在多步骤的事实推理中继续表现出显着的差距，尤其是在现实世界知识稀疏时。Grokking的最新进展表明，神经网络一旦检测到潜在的逻辑模式就可以从记忆过渡到完美概括 - 但是这些研究主要使用了小的合成任务。在本文中，我们首次将Grokkking扩展到了现实世界的事实数据，并通过使用精心设计的合成数据来增强现有知识图的挑战，以提高推断事实的比率PHI_R与Grokkking所需的原子事实的原子事实的比率。令人惊讶的是，我们发现即使是事实错误的合成数据也可以增强出现的推理电路，而不是降低准确性，因为它迫使模型依靠关系结构而不是记忆。当对多跳的推理基准进行评估时，我们的方法可在2Wikimultihopqa上获得高达95-100％的准确性 - 在强基础方面显着改善，并匹配或超过当前的最新结果。我们进一步提供了深入的分析，即增加PHI_R如何驱动变压器内部电路的形成。我们的发现表明，基于Grokking的数据增强可以解锁隐式多跳的推理能力，从而为大型语言模型中的更强大和可解释的事实推理打开了大门。

Voila：实时自主互动和语音角色扮演的语音语言基础模型

标题: Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
作者: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02707
论文链接: https://arxiv.org/pdf/2505.02707
项目链接: https://voila.maitrix.org
gitHub仓库: https://github.com/maitrix-org/Voila

英文摘要

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation – where users can simply write text instructions to define the speaker’s identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

中文摘要

无缝融合到日常生活中的声音AI代理会以自主，实时和情感表达的方式与人类互动。它不仅会对命令做出反应，还会不断地倾听，推理和主动做出反应，从而促进流体，动态和情感共鸣的互动。我们介绍了一个大型语音基础模型的家族Voila，朝着这一愿景迈出了一步。Voila通过采用新的端到端体系结构来超越传统的管道系统，该建筑能够实现完整的，低级的对话，同时保留诸如音调，节奏和情感之类的丰富声音。它仅达到了195毫秒的响应潜伏期，超过了人类平均反应时间。它的层次多尺度变压器将大语言模型（LLMS）的推理能力与强大的声学建模相结合，启用了自然的，角色吸引的语音生成 - 用户可以简单地编写文本指令来定义扬声器的身份，音调，音调和其他特征。此外，Voila支持超过一百万个预构建的声音，并有效地定制了简短音频样本的新声音，短短10秒钟。除了口语对话外，Voila被设计为用于广泛基于语音的应用程序的统一模型，包括自动语音识别（ASR），文本到语音（TTS），以及最少的适应性，多语言语音翻译。Voila完全开源，以支持开放研究，并加速前进的下一代人机相互作用。

Flow-GRPO：通过在线RL的训练流匹配模型

标题: Flow-GRPO: Training Flow Matching Models via Online RL
作者: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
日期: 2025-05-08
ArXiv主页: https://arxiv.org/abs/2505.05470
论文链接: https://arxiv.org/pdf/2505.05470
gitHub仓库: https://github.com/leffff/Diffusion-Reward-Modeling-for-Text-Rendering

英文摘要

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model’s marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.

中文摘要

我们提出了Flow-GRPO，这是将在线增强学习（RL）集成到流匹配模型中的第一种方法。我们的方法采用了两种关键策略：（1）将确定性的普通微分方程（ODE）转换为等效的随机微分方程（SDE）的ode到SDE转换，该方程（SDE）与所有时间段的原始模型的边际分布相匹配，从而实现RL探索的统计抽样；（2）一种降低策略，可在保留原始推理时间段数量的同时减少培训deno的步骤，从而显着提高采样效率而不会降解。从经验上讲，Flow-GRPO在多个文本到图像任务中都是有效的。对于复杂的组成，RL调整的SD3.5产生了几乎完美的对象计数，空间关系和细粒度的属性，从而将遗传学精度从63％提高到95％。在视觉文本渲染中，其准确性从59％提高到92％，从而显着增强了文本生成。Flow-GRPO在人类的偏好比对方面也可实现可观的收益。值得注意的是，几乎没有发生奖励黑客，这意味着奖励并没有以图像质量或多样性为代价增加，并且在我们的实验中都保持稳定。

在通往多模式通才的道路上：通用级别和一般基础

标题: On Path to Multimodal Generalist: General-Level and General-Bench
作者: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
日期: 2025-05-07
ArXiv主页: https://arxiv.org/abs/2505.04620
论文链接: https://arxiv.org/pdf/2505.04620
项目链接: https://generalist.top/
gitHub仓库: https://github.com/path2generalist/General-Level

英文摘要

The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/

中文摘要

由LLM的高级功能驱动，多模式大语言模型（MLLM）目前正在经历快速增长。与早期的专家不同，现有的MLLM正在发展为多模式通才范式。最初，这些模型不仅可以理解多种方式，不仅要理解跨模态。它们的能力已从粗粒度扩展到细粒度的多模式理解，并从支持有限的模式到任意方式。尽管存在许多基准来评估MLLM，但出现了一个关键的问题：我们可以简单地假设跨任务的较高性能表明MLLM功能更强，从而使我们更接近人级AI？我们认为答案并不像看起来那么简单。该项目介绍了通用级别，这是一个评估框架，该框架定义了5级水平的MLLM性能和一般性，提供了一种方法，以比较MLLM和衡量现有系统的进度朝着更强大的多模式通才，并最终降至AGI。框架的核心是协同作用的概念，该概念衡量模型是否在理解和产生以及多种方式之间保持一致的能力。为了支持此评估，我们提出了一般基础，其中包括更广泛的技能，方式，格式和功能，包括700多个任务和325,800个实例。涉及100多种现有最新的MLLM的评估结果揭示了通才的能力排名，强调了达到真正的AI的挑战。我们希望该项目为下一代多模式模型的未来研究铺平道路，从而提供了强大的基础架构来加速AGI的实现。项目页面：https：//generalist.top/

统一的多模式理解和发电模型：进步，挑战和机遇

标题: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
作者: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02567
论文链接: https://arxiv.org/pdf/2505.02567
gitHub仓库: https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models

英文摘要

Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o’s new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).

中文摘要

近年来，在多模式理解模型和图像产生模型中都取得了显着的进步。尽管取得了各自的成功，但这两个领域还是独立发展的，导致了独特的建筑范式：尽管基于自动进程的架构占多模式的理解，但基于扩散的模型已成为图像生成的基石。最近，人们对开发整合这些任务的统一框架的兴趣越来越大。GPT-4O的新功能的出现体现了这一趋势，突出了统一的潜力。但是，两个领域之间的建筑差异提出了重大挑战。为了清楚地概述当前统一的努力，我们提出了一项旨在指导未来研究的综合调查。首先，我们介绍了多模式理解和文本对图像生成模型的基础概念和最新进步。接下来，我们回顾现有的统一模型，将它们分为三个主要的建筑范式：基于扩散的，自回归的基于自动回归和混合方法，以融合自动回调和扩散机制。对于每个类别，我们分析相关工作引入的结构设计和创新。此外，我们编译了针对统一模型量身定制的数据集和基准，为将来的探索提供了资源。最后，我们讨论了这个新生领域面临的关键挑战，包括令牌化策略，跨模式的关注和数据。由于该领域仍处于早期阶段，我们预计会取得迅速的进步，并会定期更新此调查。我们的目标是激发进一步的研究，并为社区提供宝贵的参考。与此调查相关的参考文献可在GitHub（https://github.com/aidc-ai/awesome-unifated-multimodal-models）上获得。

RM-R1：奖励建模作为推理

标题: RM-R1: Reward Modeling as Reasoning
作者: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02387
论文链接: https://arxiv.org/pdf/2505.02387
gitHub仓库: https://github.com/RM-R1-UIUC/RM-R1

英文摘要

Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM’s interpretability and performance. In this work, we introduce a new class of generative reward models – Reasoning Reward Models (ReasRMs) – which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

中文摘要

奖励建模对于将大语言模型（LLM）与人类偏好保持一致，尤其是通过增强人类反馈（RLHF）学习。为了提供准确的奖励信号，奖励模型（RM）应在分配分数或判断之前刺激深思熟虑的思维并进行可解释的推理。但是，现有的RMS要么产生不透明的标量分数，要么直接产生了首选答案的预测，从而使他们难以整合自然语言的批评，因此缺乏可解释性。受到长期思考（COT）在推理密集型任务上的最新进展的启发，我们假设并验证将推理能力整合到奖励建模中显着增强了RM的解释性和性能。在这项工作中，我们介绍了一类新的生成奖励模型 - 推理奖励模型（REASRMS） - 将奖励建模作为推理任务。我们建议采用以推理为导向的培训管道，并培训RM-R1的REASRMS家族。培训由两个关键阶段组成：（1）蒸馏高质量的推理链和（2）具有可验证的奖励的增强学习。RM-R1通过自我生成的推理轨迹或特定于聊天的专栏来改善LLM的推出，并评估针对它们的候选响应。从经验上讲，我们的模型在多种综合奖励模型基准中实现了生成RMS的最先进或最先进的性能，超过了更大的开放式模型（例如Llama3.1-405b）和专有模型（例如，GPT-4O），最多可达13.8％。除了最终的表现，我们还进行了彻底的经验分析，以了解成功REASRM培训的关键要素。为了促进未来的研究，我们在https://github.com/rm-r1-uiuc/rm-r1上发布了六个REASRM模型以及代码和数据。

ZeroSearch：激励LLM的搜索能力而无需搜索

标题: ZeroSearch: Incentivize the Search Capability of LLMs without Searching
作者: Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang
日期: 2025-05-07
ArXiv主页: https://arxiv.org/abs/2505.04588
论文链接: https://arxiv.org/pdf/2505.04588
项目链接: https://alibaba-nlp.github.io/ZeroSearch/
gitHub仓库: https://github.com/Alibaba-nlp/ZeroSearch

英文摘要

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs’ search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

中文摘要

有效的信息搜索对于增强大语言模型（LLMS）的推理和发电能力至关重要。最近的研究探索了使用强化学习（RL）通过与现实世界环境中的现场搜索引擎进行交互，从而提高了LLM的搜索功能。尽管这些方法显示出令人鼓舞的结果，但它们面临两个主要挑战：（1）不受控制的文档质量：搜索引擎返回的文档质量通常是无法预测的，因此将噪音和不稳定引入培训过程中。（2）高度高的API成本：RL培训需要频繁推出，可能涉及成千上万的搜索请求，这会导致大量API支出并严重限制可扩展性。为了应对这些挑战，我们介绍了ZeroSearch，这是一个强化学习框架，该框架激励LLM的搜索功能而无需与真实的搜索引擎进行交互。我们的方法始于轻量监督的微调，以将LLM转换为一个检索模块，该模块能够生成相关和嘈杂的文档以响应查询。在RL培训期间，我们采用了一种基于课程的推出策略，从而逐渐降低了生成的文档的质量，从而逐渐通过将其暴露于越来越具有挑战性的检索方案来逐渐引起模型的推理能力。广泛的实验表明，ZeroSearch使用3B LLM作为检索模块有效地激励LLM的搜索功能。值得注意的是，一个7B检索模块与真实搜索引擎的性能相当，而14B检索模块甚至超过了它。此外，它在各种参数尺寸的基础和指令调整模型中都很好地概括了，并且与广泛的RL算法兼容。

Pixelhacker：与结构和语义一致性的图像介绍

标题: PixelHacker: Image Inpainting with Structural and Semantic Consistency
作者: Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang
日期: 2025-04-29
ArXiv主页: https://arxiv.org/abs/2504.20438
论文链接: https://arxiv.org/pdf/2504.20438
项目链接: https://hustvl.github.io/PixelHacker
gitHub仓库: https://github.com/hustvl/PixelHacker

英文摘要

Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.

中文摘要

图像介绍是图像编辑和图像生成之间的一个基本研究领域。最近的最新方法（SOTA）方法探索了新颖的注意机制，轻质体系结构和上下文感知的建模，表现出令人印象深刻的性能。但是，它们通常在复杂的结构（例如纹理，形状，空间关系）和语义（例如颜色一致性，对象恢复和逻辑正确性）中挣扎，从而导致文物和不适当的产生。为了应对这一挑战，我们设计了一种称为潜在类别指南的简单而有效的填充范式，并进一步提出了一个名为Pixelhacker的基于扩散的模型。具体而言，我们首先通过注释前景和背景（分别为116和21个类别）来构建一个包含1400万个图像面罩对的大数据集。然后，我们通过两个固定尺寸的嵌入方式分别编码潜在的前景和背景表示，并通过线性注意力间歇地将这些特征注入去胶过程中。最后，通过对数据集进行预培训并在开源基准上进行微调，我们获得了PixelHacker。广泛的实验表明，Pixelhacker在广泛的数据集（Place2，Celeba-HQ和FFHQ）上全面胜过SOTA，并且在结构和语义方面表现出显着的一致性。https://hustvl.github.io/pixelhacker的项目页面。

美洲驼：有效的推理模型

标题: Llama-Nemotron: Efficient Reasoning Models
作者: Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
日期: 2025-05-02
ArXiv主页: https://arxiv.org/abs/2505.00949
论文链接: https://arxiv.org/pdf/2505.00949

英文摘要

We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes – Nano (8B), Super (49B), and Ultra (253B) – and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models – LN-Nano, LN-Super, and LN-Ultra – under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

中文摘要

我们介绍了Llama-Nemotron系列模型，这是一个开放的异构推理模型家族，可提供出色的推理能力，推理效率和企业使用的开放许可。这个家庭有三种尺寸 - 纳米（8b），Super（49b）和Ultra（253b） - 并与最先进的推理模型（例如DeepSeek-R1）一起竞争，同时提供了出色的推理吞吐量和记忆效率。在本报告中，我们讨论了这些模型的培训程序，这些模型需要使用Llama 3模型的神经体系结构搜索进行加速推理，知识蒸馏和持续预处理，然后进行以推理为中心的训练后阶段，由两个主要部分组成：受监管的精细调整和大规模增强学习。Llama-Nemotron模型是支持动态推理切换的第一个开源模型，从而使用户可以在推理过程中在标准聊天和推理模式之间进行切换。为了进一步支持开放研究并促进模型开发，我们提供以下资源：1。我们发布了商业上允许的Nvidia Open Model许可协议。2。我们发布完整的训练后数据集：Llama-Nemotron-Post-Training-Dataset。3。我们还发布了培训代码库：Nemo，Nemo-Aligner和Megatron-LM。

MUON预处理的实用效率

标题: Practical Efficiency of Muon for Pretraining
作者: Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani
日期: 2025-05-04
ArXiv主页: https://arxiv.org/abs/2505.02222
论文链接: https://arxiv.org/pdf/2505.02222
项目链接: https://www.essential.ai/blog/optimizer

英文摘要

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

中文摘要

我们证明了二阶优化器最简单的实例化MUON明确地扩展了帕累托边境在Compute Time折衷的情况下通过ADAMW扩展。我们发现，MUON比ADAMW更有效地保留大批批量的数据效率，远远超出了所谓的临界批次尺寸，同时又保持了计算上的效率，从而实现了更经济的培训。我们研究了MUON和最大更新参数化（MUP）的组合，以进行有效的超参数转移，并提出一种简单的望远镜算法，该算法说明了MUP中所有错误源，同时仅引入资源中的一个适度的开销。我们通过广泛的实验来验证我们的发现，其模型大小最多四十亿个参数以及在数据分布和体系结构上消融。

关于大语模型推理引擎的调查：优化和效率的观点

标题: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
作者: Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee
日期: 2025-05-03
ArXiv主页: https://arxiv.org/abs/2505.01658
论文链接: https://arxiv.org/pdf/2505.01658
gitHub仓库: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

英文摘要

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

中文摘要

大型语言模型（LLM）广泛应用于聊天机器人，代码生成器和搜索引擎。通过反复调用模型，诸如经过思考，复杂的推理和代理服务等工作负载大大增加了推理成本。采用了诸如并行性，压缩和缓存之类的优化方法来降低成本，但是多样化的服务要求使得很难选择正确的方法。最近，专门的LLM推理引擎已成为将优化方法集成到面向服务的基础架构中的关键组件。但是，仍然缺乏对推理引擎的系统研究。本文对25种开源和商业推理引擎进行了全面评估。我们以易用性，易于启动性，通用支持，可扩展性以及对吞吐量和潜伏期感知的计算的适用性来检查每个推理引擎。此外，我们通过研究其支持的优化技术来探索每个推理引擎的设计目标。此外，我们评估了开源推理引擎的生态系统成熟度，并处理商业解决方案的性能和成本政策。我们概述了未来的研究方向，其中包括对复杂的基于LLM的服务的支持，各种硬件的支持以及增强的安全性，并向研究人员和开发人员选择和设计优化的LLM推理引擎为研究人员和开发人员提供实用指导。我们还提供了一个公共存储库，以不断跟踪这个快速发展的领域的发展：https：//github.com/sihyeong/awesome-llm-inm-inpery-enfere-engine-engine-

通过增强学习的LLM的代理推理和工具集成

标题: Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
作者: Joykirat Singh, Raghav Magazine, Yash Pandya, Akshay Nambi
日期: 2025-04-28
ArXiv主页: https://arxiv.org/abs/2505.01441
论文链接: https://arxiv.org/pdf/2505.01441
项目链接: https://www.microsoft.com/en-us/research/people/akshayn/unlocking-agentic-reasoning-in-llms/

英文摘要

Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.

中文摘要

大型语言模型（LLM）在复杂的推理任务中取得了显着进步，但由于依赖静态内部知识和仅文本推理，它们在根本上仍然限制了它们。现实世界中的解决问题通常需要动态，多步推理，自适应决策以及与外部工具和环境互动的能力。在这项工作中，我们介绍了艺术家（自我改进变压器中的代理推理和工具集成），这是一个统一的框架，与LLMS的代理推理，强化学习和工具集成紧密结合。艺术家使模型能够自主确定在多转弯推理链中何时，如何和哪些工具，利用基于结果的RL来学习强大的策略，以进行工具使用和环境互动，而无需阶梯级别的监督。关于数学推理和多转弯功能的广泛实验，称为基准测试表明，艺术家始终优于最先进的基线，比基本模型的绝对改进高达22％，并且在最具挑战性的任务上获得了强劲的增长。详细的研究和度量分析表明，代理RL培训可导致更深入的推理，更有效的工具使用和更高质量的解决方案。我们的结果建立了具有工具集成的代理RL，作为在LLM中可靠，可解释和可推广的问题解决的强大新领域。

Hunyuancustom：定制视频生成的多模式驱动架构

标题: HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
作者: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
日期: 2025-05-07
ArXiv主页: https://arxiv.org/abs/2505.04512
论文链接: https://arxiv.org/pdf/2505.04512
项目链接: https://hunyuancustom.github.io/
gitHub仓库: https://github.com/Tencent/HunyuanCustom

英文摘要

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

中文摘要

定制的视频生成旨在制作具有灵活用户定义条件下特定主题的视频，但是现有方法通常会因身份一致性和有限的输入方式而苦苦挣扎。在本文中，我们提出了Hunyuancustom，这是一个多模式定制的视频生成框架，强调主题一致性，同时支持图像，音频，视频和文本条件。我们的模型建立在HunyuanVideo的基础上，首先通过引入基于LLAVA的文本图像融合模块来解决图像文本条件的生成任务，以增强多模式的理解，并利用临时关注的图像ID增强模块，以增强跨框架跨框架的标识功能。为了启用音频和视频条件的生成，我们进一步提出了特定于模式的状态注入机制：通过空间跨注意来实现层次对齐的AUDIONET模块，以及通过基于贴片的特征 - 对准网络集成潜在的潜在压缩条件视频的视频驱动注入模块。关于单一和多主体场景的广泛实验表明，就ID一致性，现实主义和文本视频对齐方式而言，Hunyuancustom显着超过了最先进的开放和封闭式方法。此外，我们验证了其在下游任务中的鲁棒性，包括音频和视频驱动的自定义视频生成。我们的结果突出了多模式调节和具有身份的策略在推进可控视频生成方面的有效性。所有代码和模型均可在https://hunyuancustom.github.io中找到。

RADLADS：快速注意力蒸馏到线性注意解码器上

标题: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
作者: Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.03005
论文链接: https://arxiv.org/pdf/2505.03005

英文摘要

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

中文摘要

我们将快速的注意力蒸馏呈现到线性注意解码器（RADLADS），该协议是一种快速转化SoftMax注意变形金刚转换为线性注意力解码器模型的协议，以及两个新的RWKV变量架构，并从7B、32B和72B尺寸的流行QWEN2.5开源模型转换为模型。我们的转换过程仅需要350-700m令牌，不到用于培训原始教师模型的代币计数的0.005％。以今天的价格转换为我们的72B线性注意模型的成本低于$ 2,000美元，但推断的质量仍然接近原始变压器。这些模型可在一组标准基准测试中实现最新的下游性能，以实现其大小的线性注意模型。除了我们的72B型号外，我们还按照QWEN许可协议约束，我们在Apache 2.0许可下发布了所有模型。https://huggingface.co/collections/recursal/radlads-6818ee69e9e729e729ba8a877102培训代码https://github.com/recurs.com/recursal/radlads-paper-paper-paper-paper-paper-

形式上：大型语言模型的正式数学推理基准测试

标题: FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
作者: Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02735
论文链接: https://arxiv.org/pdf/2505.02735
项目链接: https://sphere-ai-lab.github.io/FormalMATH/
gitHub仓库: https://github.com/Sphere-AI-Lab/FormalMATH-Bench

英文摘要

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

中文摘要

正式的数学推理仍然是人工智能的关键挑战，这受到范围和规模现有基准测试的局限性的阻碍。为了解决这个问题，我们提出了正式的大规模精益基准，其中包括5,560个正式验证的问题，这些问题从高中奥林匹克挑战到各种领域的本科级别定理（例如，代数，应用数学，计算，计算，数字理论和离散数学）。为了减轻手动形式化的效率低下，我们引入了一种新型的人类自动化管道，该管道集成了：（1）用于语句自动化的专业大语模型（LLMS），（2）多LLM语义验证，以及（3）基于基于单位LLM的Provers基于基于否定的调查滤波策略。这种方法通过在手动验证前保留72.09％的陈述，同时确保对原始自然语言问题的保真度来降低专家注释成本。我们对基于LLM的最先进的定理掠夺者的评估揭示了重大局限性：即使是最强的模型，在实际采样预算下，最强的模型也只能达到16.46％的成功率，表现出明显的域偏见（例如，在代数方面表现出色，但在积分中失败了），并且对简化自传策略的实质性过高。值得注意的是，我们确定了自然语言解决方案指导和证明成功的推理方案中的成功之间的违反逆关系，这表明在正式的推理环境中，人写的非正式推理引入了噪音，而不是清晰。我们认为，形式上有一个强大的基准测试正式的数学推理。

通过图层内存提高图像生成的编辑性

标题: Improving Editability in Image Generation with Layer-wise Memory
作者: Daneul Kim, Jaeah Lee, Jaesik Park
日期: 2025-05-02
ArXiv主页: https://arxiv.org/abs/2505.01079
论文链接: https://arxiv.org/pdf/2505.01079
项目链接: https://carpedkm.github.io/projects/improving_edit/index.html
gitHub仓库: https://github.com/carpedkm/improving-editability

英文摘要

Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.

中文摘要

大多数真实的图像编辑任务都需要多个顺序编辑以获得所需的结果。当前的编辑方法主要是为单对象修改而设计的，它在连续编辑中遇到困难：尤其是在维护先前的编辑以及将新对象自然地调整为现有内容的过程中。这些局限性大大阻碍了复杂的编辑方案，在保留其上下文关系的同时，需要修改多个对象。我们通过两个关键建议解决这一基本挑战：启用粗略的面具输入，以保留现有内容，同时自然整合新元素并支持跨多个修改的一致编辑。我们的框架通过层的内存来实现这一目标，该内存存储了以前的编辑中的潜在表示和提示嵌入。我们提出了背景一致性指导，以利用记忆的潜伏期在交叉注意力中保持场景连贯性和多质量分离，以确保自然适应现有内容。为了评估我们的方法，我们提出了一个新的基准数据集，其中包含语义一致性指标和交互式编辑方案。通过全面的实验，我们在迭代图像编辑任务中以最少的用户工作来展示卓越的性能，仅需要粗糙的掩码，同时在多个编辑步骤中保持高质量的结果。

从文本中生成物理稳定且可建造的乐高设计

标题: Generating Physically Stable and Buildable LEGO Designs from Text
作者: Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu
日期: 2025-05-08
ArXiv主页: https://arxiv.org/abs/2505.05469
论文链接: https://arxiv.org/pdf/2505.05469
项目链接: https://avalovelace1.github.io/LegoGPT/

英文摘要

We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/.

中文摘要

我们介绍了Legogpt，这是从文本提示中生成物理稳定的乐高砖模型的第一种方法。为了实现这一目标，我们构建了一个大规模的，物理稳定的乐高设计数据集以及其相关的字幕，并训练自回归的大型语言模型，以预测下一个通过下一步预测添加的砖块。为了提高所得设计的稳定性，我们在自回旋推理期间采用了有效的有效性检查和物理意识回滚，通过物理法律和组装约束，可以预测可行的令牌预测。我们的实验表明，Legogpt产生稳定，多样和美观的乐高设计，与输入文本提示紧密相符。我们还开发了一种基于文本的乐高纹理方法来生成彩色和纹理设计。我们表明，我们的设计可以由人类手动组装，并由机器人臂自动组装。我们还发布了新的数据集stabletext2lego，其中包含超过28,000个独特3D对象的47,000多个乐高结构，并附有详细的字幕，以及我们的代码和模型：https：//avalovelace1.githaub.io.io/legogpt/。

OpenVision：一个完全开放的，具有成本效益的高级视觉编码家族，用于多模式学习

标题: OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
作者: Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie
日期: 2025-05-07
ArXiv主页: https://arxiv.org/abs/2505.04601
论文链接: https://arxiv.org/pdf/2505.04601
项目链接: https://ucsc-vlaa.github.io/OpenVision/
gitHub仓库: https://github.com/UCSC-VLAA/OpenVision

英文摘要

OpenAI’s CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI’s CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works – e.g., CLIPS for training framework and Recap-DataComp-1B for training data – while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

中文摘要

Openai的剪辑于2021年初发布，长期以来一直是建立多模式基础模型的视觉编码器的首选。尽管诸如Siglip之类的最新替代方案已经开始挑战这种现状，但据我们所知，没有什么完全开放的：他们的培训数据仍然专有和/或他们的培训食谱尚未发布。本文用OpenVision（一个完全开放的，具有成本效益的视觉编码家族）填补了这一空白，当将Openai剪辑的性能匹配或超过Openai剪辑的性能中时，当整合到Llava等多模式框架中。OpenVision建立在现有作品的基础上 - 例如，用于培训框架的剪辑和用于培训数据的Recap-datacomp-1b，同时揭示了增强编码器质量的多个关键见解，并在推进多模式模型方面展示了实际好处。通过释放跨越590万至6.321亿参数的视觉编码器，OpenVision为从业者提供了在构建多模式模型的容量和效率之间的灵活权衡：较大的型号提供增强的多模式性能，而较小的版本则可以轻巧，可轻巧的，可用于Edge Edge-Reve-Reve-Reve Multododal模式部署。

灵活性：在异质场景中朝着灵活的动作控制

标题: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
作者: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
日期: 2025-05-06
ArXiv主页: https://arxiv.org/abs/2505.03730
论文链接: https://arxiv.org/pdf/2505.03730
项目链接: https://shiyi-zh0408.github.io/projectpages/FlexiAct/
gitHub仓库: https://github.com/shiyi-zh0408/FlexiAct

英文摘要

Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

中文摘要

动作自定义涉及生成视频，主题执行由输入控制信号决定的动作。当前方法使用姿势引导或全局运动定制，但受到对空间结构的严格限制的限制，例如布局，骨架和观点一致性，从而降低了各种主题和场景的适应性。为了克服这些局限性，我们提出了FlexIACT，将动作从参考视频转移到任意目标图像。与现有方法不同，FlexIACT可以在参考视频和目标图像的主题之间进行布局，观点和骨骼结构的变化，同时保持身份一致性。实现这一目标需要精确的动作控制，空间结构适应和一致性保存。为此，我们介绍了Refadapter，这是一种轻巧的图像条件适配器，在空间适应和一致性保护方面擅长，超过了平衡外观一致性和结构灵活性方面的现有方法。此外，根据我们的观察，在不同时间段处表现出对运动（低频）和外观细节（高频）的关注程度的不同程度。因此，我们提出了FAE（频率感知动作提取），与依靠单独的时空体系结构的现有方法不同，它在deoising过程中直接实现了动作提取。实验表明，我们的方法有效地将动作转移到具有不同布局，骨骼和观点的受试者。我们发布代码和模型权重，以支持https://shiyi-Zh0408.github.io/projectpages/flexiact/

retroinfer：可扩展长篇小写LLM推理的矢量存储方法

标题: RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
作者: Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02922
论文链接: https://arxiv.org/pdf/2505.02922
项目链接: https://baotonglu.github.io/RetroInfer_page/RetroInfer.html
gitHub仓库: https://github.com/microsoft/RetrievalAttention

英文摘要

The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

中文摘要

大型语言模型（LLMS）的不断增长的上下文长度对有效的推断构成了重大挑战，这主要是由于GPU记忆和带宽约束。我们提出了Retroinfer，这是一个新型系统，将键值（KV）缓存重新概念化为矢量存储系统，该系统利用了固有的注意力稀疏性，以加速长篇小说LLM推断。从本质上讲，波浪指数是一种注意力感知的向量指数，可以通过诸如三方关注近似，准确的注意力估计和分段聚类等技术有效，准确地检索关键令牌。与此相辅相成的是波缓冲区，它协调了KV缓存的位置，并重叠了跨GPU和CPU的计算和数据传输，以维持高吞吐量。与以前的基于稀疏的方法与代币选择和硬件协调斗争不同，Retroinfer在不损害模型准确性的情况下提供了强大的性能。在GPU内存限制内的全部注意力中，长篇小写基准测试的实验最高为4.5倍，而当KV缓存扩展到CPU内存时，高达10倍，而稀疏的注意力基线，同时保留了全注意力级别的准确性。

R1奖励：通过稳定的增强学习培训多模式奖励模型

标题: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
作者: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
日期: 2025-05-05
ArXiv主页: https://arxiv.org/abs/2505.02835
论文链接: https://arxiv.org/pdf/2505.02835
项目链接: https://github.com/yfzhang114/r1_reward
gitHub仓库: https://github.com/yfzhang114/r1_reward

英文摘要

Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward’s performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

中文摘要

多模式奖励模型（MRMS）在增强多模式大语模型（MLLM）的性能方面起着至关重要的作用。尽管最近的进步主要集中在改善MRMS的模型结构和培训数据上，但对长期推理能力的有效性的奖励模型以及如何激活MRM中的这些功能的探索有限。在本文中，我们探讨了如何使用加强学习（RL）来改善奖励建模。具体而言，我们将奖励建模问题重新制定为基于规则的RL任务。但是，我们观察到，直接应用现有的RL算法（例如增强++）来奖励建模通常会导致训练不稳定性，甚至由于这些算法的固有局限性而崩溃。为了解决这个问题，我们提出了StablereInforce算法，该算法完善了现有RL方法的培训损失，优势估计策略和奖励设计。这些改进会导致更稳定的训练动力和出色的性能。为了促进MRM培训，我们从不同数据集收集200K偏好数据。我们的奖励模型R1奖励，使用该数据集上的StablereInforce算法进行了训练，可显着提高多模式奖励建模基准的性能。与以前的SOTA型号相比，R1奖励在VL奖励台上提高了8.4％，多模式奖励台提高了14.3％。此外，随着更多的推理计算，R1奖励的性能进一步提高，突出了RL算法在优化MRMS方面的潜力。

作为法官的知觉代理：评估大语言模型中的高阶社交认知

标题: Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
作者: Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
日期: 2025-05-01
ArXiv主页: https://arxiv.org/abs/2505.02847
论文链接: https://arxiv.org/pdf/2505.02847

英文摘要

Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM’s higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

中文摘要

评估大型语言模型（LLM）对人类的理解，而不仅仅是文本，这仍然是一个开放的挑战。为了弥合差距，我们引入了有声学代理作为法官（SAGE），这是一个自动化评估框架，可衡量LLM的高阶社交认知。Sage实例化了一种有知觉的代理，该代理在互动过程中模拟了类似人类的情感变化和内在思想，从而在多转交谈中对测试模型进行了更现实的评估。在任何情况下，代理的原因（i）情绪如何变化，（ii）感觉如何，以及（iii）应该如何回复，产生数值的情感轨迹和可解释的内在思想。对100个支持二元式方案的实验表明，最终的情感得分与Barrett-Lennard关系清单（BLRI）评分和话语级别的同理心度量密切相关，从而验证了心理忠诚度。我们还建立了一个公共知名的排行榜，涵盖了18种商业和开源模型，该模型在边境系统（GPT-4O-LATEST，GEMINI2.5-PRO）和早期的基线之间发现了巨大差距（最高4倍），并且在传统的排行榜中未反映的差距（例如，竞技场）。因此，Sage提供了一种原则性，可扩展性和可解释的工具，用于跟踪对真正善解人意和社会熟练的语言代理的进步。

原始作品：自动回归变压器的人工制作的3D原始组装产生

标题: PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer
作者: Jingwen Ye, Yuze He, Yanning Zhou, Yiqin Zhu, Kaiwen Xiao, Yong-Jin Liu, Wei Yang, Xiao Han
日期: 2025-05-07
ArXiv主页: https://arxiv.org/abs/2505.04622
论文链接: https://arxiv.org/pdf/2505.04622
项目链接: https://primitiveanything.github.io/
gitHub仓库: https://github.com/PrimitiveAnything/PrimitiveAnything

英文摘要

Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io

中文摘要

将复杂的3D形状分解为简单几何元素的形状原始抽象在人类的视觉认知中起着至关重要的作用，并且在计算机视觉和图形中具有广泛的应用。尽管3D内容生成的最新进展表现出了显着的进展，但现有的原始抽象方法依赖于具有有限的语义理解的几何优化，或者从特定于类别的小型，特定类别的数据集中学习，努力跨越各种形状类别。我们提出了原始的框架，这是一个新颖的框架，将原始抽象重新定义为原始组装生成任务。原始趋势包括用于自动回归生成的形状条件的原始变压器和一种无模棱两可的参数化方案，以统一的方式表示多种类型的原语。所提出的框架直接从大规模的人类制作的抽象中学习了原始组装的过程，从而使其能够捕获人类如何将复杂形状分解为原始元素。通过广泛的实验，我们证明了原始趋势可以产生高质量的原始组件，从而更好地与人类的感知保持一致，同时保持各种形状类别的几何忠诚度。它受益于各种3D应用程序，并显示出在游戏中启用基于原始的用户生成内容（UGC）的潜力。项目页面：https：//primitiveanything.github.io

查看全文

http://www.dtcms.com/a/393928.html

鸿蒙 - 验证码功能

Bioconductor 项目为高通量生物数据分析提供了大量强大的工具 Bioconductor规范，核心是一系列设计精良、标准化的数据对象

还有新援？利物浦即将启动预签协议，锁定英格兰新星

Audacity音频软件介绍和使用

SpringBoot配置优化：Tomcat+数据库+缓存+日志全场景教程

《数据库系统概论》——陈红、卢卫-1-数据库系统概述

VLA-Adapter：一种适用于微型 VLA 的有效范式

JVM内存模型深度剖析与优化

固定收益理论（六）波动率曲面、曲线及其构建模型

Zotero使用学习笔记

分布式 | 布隆过滤器实战指南：原理、编码实现、应用与Redisson最佳实践

STM32的VSCode下开发环境搭建

Rsync+sersync实现数据实时同步

HttpServletRequest/Response/请求转发/响应重定向

数据结构（2） —— 双向链表、循环链表与内核链表

告别传统打版:用CLO 3D联动Substance，打造超写实数字服装

Linux | i.MX6ULL Sqlite3 移植和使用（第二十三章）

SpringBoot整合Smart Doc

部署dataxweb

C#练习题——双向链表的创建，添加和删除

大厂思维与“小快轻准”产品的矛盾

C++二进制转八进制

STL容器 --- 模拟实现 list

Java LTS版本进化秀：从8到21的欢乐升级之旅

yolo转tensorrt nano

paimon实时数据湖教程-分桶详解

kafka集群部署

Windows系统安装OpenSSL库最新版方法

因果推断：关于工具变量的案例分析

目录

感知，理性，思考和计划：大型多模式推理模型的调查

英文摘要

中文摘要

绝对零：用零数据加强自我播放推理

英文摘要

中文摘要

通过加强微调，统一的多模式链奖励奖励模型

英文摘要

中文摘要

在野外Grokking：使用变压器的真实世界多跳跃推理的数据增强

英文摘要

中文摘要

Voila：实时自主互动和语音角色扮演的语音语言基础模型

英文摘要

中文摘要

Flow-GRPO：通过在线RL的训练流匹配模型

英文摘要

中文摘要

在通往多模式通才的道路上：通用级别和一般基础

英文摘要

中文摘要

统一的多模式理解和发电模型：进步，挑战和机遇

英文摘要

中文摘要

RM-R1：奖励建模作为推理

英文摘要

中文摘要

ZeroSearch：激励LLM的搜索能力而无需搜索

英文摘要

中文摘要

Pixelhacker：与结构和语义一致性的图像介绍

英文摘要

中文摘要

美洲驼：有效的推理模型

英文摘要

中文摘要

MUON预处理的实用效率

英文摘要

中文摘要

关于大语模型推理引擎的调查：优化和效率的观点

英文摘要

中文摘要

通过增强学习的LLM的代理推理和工具集成

英文摘要

中文摘要

Hunyuancustom：定制视频生成的多模式驱动架构

英文摘要

中文摘要

RADLADS：快速注意力蒸馏到线性注意解码器上

英文摘要

中文摘要

形式上：大型语言模型的正式数学推理基准测试

英文摘要

中文摘要

通过图层内存提高图像生成的编辑性

英文摘要

中文摘要

从文本中生成物理稳定且可建造的乐高设计

英文摘要

中文摘要

OpenVision：一个完全开放的，具有成本效益的高级视觉编码家族，用于多模式学习

英文摘要

中文摘要

灵活性：在异质场景中朝着灵活的动作控制

英文摘要

中文摘要

retroinfer：可扩展长篇小写LLM推理的矢量存储方法

英文摘要

中文摘要

R1奖励：通过稳定的增强学习培训多模式奖励模型

英文摘要

中文摘要

作为法官的知觉代理：评估大语言模型中的高阶社交认知

英文摘要

中文摘要

原始作品：自动回归变压器的人工制作的3D原始组装产生

英文摘要

中文摘要

相关文章：