当前位置：首页 > news >正文

【论文速递】2025年第26周(Jun-22-28)(Robotics/Embodied AI/LLM)

news 2025/9/23 8:42:35

中文使用 googletrans 翻译，翻译不对的地方以英文为准

拖放LLM：零样本提示权重
- 英文摘要
- 中文摘要
正常之光：通用光度立体声的统一特征表示
- 英文摘要
- 中文摘要
您所需要的视觉指导块是：通过多模式文档理解增强抹布
- 英文摘要
- 中文摘要
Omnigen2：探索高级多模式生成
- 英文摘要
- 中文摘要
矩阵游戏：互动世界基础模型
- 英文摘要
- 中文摘要
FineWeb2：一个将它们全部扩展的管道 - 将预培训数据处理调整为每种语言
- 英文摘要
- 中文摘要
sharegpt-4o图像：与GPT-4O级图像生成对齐多模式
- 英文摘要
- 中文摘要
mmsearch-r1：激励LMMS搜索
- 英文摘要
- 中文摘要
Jarvisart：通过智能照片修饰代理解放人类艺术创造力
- 英文摘要
- 中文摘要
寄生虫：在视觉生成模型中有效稀疏和量化注意力的模式感知排序
- 英文摘要
- 中文摘要
Animax：使用联合视频置式扩散模型为3D动画动画
- 英文摘要
- 中文摘要
长作者零：通过加强学习来掌握超长的文本生成
- 英文摘要
- 中文摘要
Hunyuan-Gamecraft：具有混合历史状况的高动力互动游戏视频生成
- 英文摘要
- 中文摘要
Mind2Web 2：用AS-A-A-Gudge评估代理搜索
- 英文摘要
- 中文摘要
Skywork-SWE：揭示LLMS软件工程的数据扩展法
- 英文摘要
- 中文摘要
OCTOPHINKER：中期训练激励增强学习缩放
- 英文摘要
- 中文摘要
大型语言模型可靠的4位量化的外部安全预培训
- 英文摘要
- 中文摘要
逆编辑：通过周期一致性模型进行有效且快速的图像编辑
- 英文摘要
- 中文摘要
Fasta^\*：带有子例程挖掘的快速慢刀具路径，用于有效的多圈图像编辑
- 英文摘要
- 中文摘要
专家链：解锁Experts模型混合物的通信能力
- 英文摘要
- 中文摘要
Worldvla：迈向自动进取行动世界模型
- 英文摘要
- 中文摘要
Madrive：内存启动驾驶场景建模
- 英文摘要
- 中文摘要
Viki-R：通过增强学习协调具体的多代理合作
- 英文摘要
- 中文摘要
OAGENTS：建筑有效代理的实证研究
- 英文摘要
- 中文摘要
视觉作为方言：通过文本对准表示统一视觉理解和生成
- 英文摘要
- 中文摘要
RLPR：将RLVR推送到没有验证者的通用域
- 英文摘要
- 中文摘要

拖放LLM：零样本提示权重

标题: Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
作者: Zhiyuan Liang, Dongwen Tang, Yuhao Zhou, Xuanlei Zhao, Mingjia Shi, Wangbo Zhao, Zekai Li, Peihao Wang, Konstantin Schürholt, Damian Borth, Michael M. Bronstein, Yang You, Zhangyang Wang, Kai Wang
日期: 2025-06-19
ArXiv主页: https://arxiv.org/abs/2506.16406
论文链接: https://arxiv.org/pdf/2506.16406
项目链接: https://jerryliang24.github.io/DnD
gitHub仓库: https://github.com/jerryliang24/Drag-and-Drop-LLMs

英文摘要

Modern Parameter-Efficient Fine-Tuning (PEFT) methods such as low-rank adaptation (LoRA) reduce the cost of customizing large language models (LLMs), yet still require a separate optimization run for every downstream dataset. We introduce Drag-and-Drop LLMs (\textit{DnD)}, a prompt-conditioned parameter generator that eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates. A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices. Once trained in a diverse collection of prompt-checkpoint pairs, DnD produces task-specific parameters in seconds, yielding i) up to 12,000times lower overhead than full fine-tuning, ii) average gains up to 30% in performance over the strongest training LoRAs on unseen common-sense reasoning, math, coding, and multimodal benchmarks, and iii) robust cross-domain generalization despite never seeing the target data or labels. Our results demonstrate that prompt-conditioned parameter generation is a viable alternative to gradient-based adaptation for rapidly specializing LLMs. Our project is available at https://jerryliang24.github.io/DnD{https://jerryliang24.github.io/DnD}.

中文摘要

现代参数高效微调（PEFT）方法，如低秩自适应（LoRA），降低了定制大型语言模型（LLM）的成本，但仍然需要对每个下游数据集进行单独的优化运行。我们引入了拖放LLM（\textit{DnD）}，这是一个提示条件参数生成器，通过将少数未标记的任务提示直接映射到LoRA权重更新来消除每任务训练。轻量级文本编码器将每个提示批提取为条件嵌入，然后由级联的超卷积解码器将其转换为完整的LoRA矩阵集。一旦在各种提示检查点对中进行了训练，DnD就会在几秒钟内产生特定于任务的参数，从而产生i）比完全微调低12000倍的开销，ii）在看不见的常识推理、数学、编码和多模态基准上，与最强的训练LoRA相比，性能平均提高了30%，iii）尽管从未见过目标数据或标签，但具有鲁棒的跨域泛化能力。我们的结果表明，对于快速专门化LLM，快速条件参数生成是基于梯度的自适应的可行替代方案。我们的项目可在https://jerryliang24.github.io/DnD{https://jerryliang24.github.io/DnD}上找到。

正常之光：通用光度立体声的统一特征表示

标题: Light of Normals: Unified Feature Representation for Universal Photometric Stereo
作者: Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18882
论文链接: https://arxiv.org/pdf/2506.18882
gitHub仓库: https://github.com/houyuanchen111/LINO_UniPS

英文摘要

Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately.

中文摘要

通用光度立体声（PS）旨在从任意照明条件下的物体中恢复高质量的表面正态，而不依赖于特定的照明模型。尽管最近的进步，例如SDM-UNIPS和UNI MS-PS，但两个基本挑战仍然存在：1）不同的照明和表面正常特征之间的深层耦合，在观察到的强度中的歧义使亮度变化很难从照明变化或表面方向变化。2）在复杂表面中保存高频几何细节，其中复杂的几何形状会产生自我阴影，反射和细微的正常变化，而常规特征处理操作则难以准确捕获。

您所需要的视觉指导块是：通过多模式文档理解增强抹布

标题: Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
作者: Vishesh Tripathi, Tanmay Odapally, Indraneel Das, Uday Allu, Biddwan Ahmed
日期: 2025-06-19
ArXiv主页: https://arxiv.org/abs/2506.16035
论文链接: https://arxiv.org/pdf/2506.16035

英文摘要

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.

中文摘要

检索增强的生成（RAG）系统已彻底改变了信息检索和问答，但是传统的基于文本的分解方法与复杂的文档结构，多页表，嵌入式图形以及跨页面边界的上下文依赖关系而努力。我们提出了一种新型的多模式文档块方法，该方法利用大型多模型模型（LMM）分批处理PDF文档，同时保持语义相干性和结构完整性。我们的方法处理可配置的页面批处理中的文档，并具有跨批处理上下文保存，从而可以准确处理跨越多个页面，嵌入式视觉元素和过程内容的表。我们在使用手动制作的查询的PDF文档数据集上评估了我们的方法，这表明了块质量和下游抹布性能的改善。与传统的香草抹布系统相比，我们的视觉指导方法的准确性更高，定性分析表明文档结构和语义连贯性的卓越保存。

Omnigen2：探索高级多模式生成

标题: OmniGen2: Exploration to Advanced Multimodal Generation
作者: Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18871
论文链接: https://arxiv.org/pdf/2506.18871
项目链接: https://vectorspacelab.github.io/OmniGen2/
gitHub仓库: https://github.com/VectorSpaceLab/OmniGen2

英文摘要

In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

中文摘要

在这项工作中，我们介绍了Omnigen2，这是一种多功能且开源的生成模型，旨在为各种生成任务提供统一的解决方案，包括文本对图像，图像编辑和文本生成。与Omnigen V1不同，Omnigen2采用了两种不同的文本和图像模式解码途径，利用未共享参数和一个解耦的图像令牌。该设计使Omnigen2能够基于现有的多模式理解模型，而无需重新适应VAE输入，从而保留了原始的文本生成功能。为了促进Omnigen2的培训，我们开发了全面的数据构建管道，涵盖了图像编辑和内在的生成数据。此外，我们引入了一种针对图像生成任务的反射机制，并基于Omnigen2策划专用反射数据集。尽管具有相对适度的参数大小，Omnigen2还是在多个任务基准（包括文本对图像和图像编辑）上取得了竞争性结果。为了进一步评估也称为主题驱动的任务的文本中的生成，我们引入了一个名为Omnicontext的新基准。Omnigen2就一致性而言，在开源模型中实现了最先进的性能。我们将发布我们的模型，培训代码，数据集和数据构建管道，以支持该领域的未来研究。项目页面：https：//vectorspacelab.github.io/omnigen2;github链接：https：//github.com/vectorspacelab/omnigen2

矩阵游戏：互动世界基础模型

标题: Matrix-Game: Interactive World Foundation Model
作者: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18701
论文链接: https://arxiv.org/pdf/2506.18701
项目链接: https://matrix-game-homepage.github.io
gitHub仓库: https://github.com/SkyworkAI/Matrix-Game

英文摘要

We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

中文摘要

我们介绍了Matrix-Game，这是一种可控游戏世界的互动世界基础模型。使用两阶段的管道对矩阵游戏进行了训练，该管道首先执行大规模的未标记预处理，以了解环境的理解，然后进行互动视频生成的动作标记培训。为了支持这一点，我们策划了Matrix-game-MC，这是一个全面的Minecraft数据集，其中包括超过2700个小时的未标记的游戏片段和超过1,000个小时的高质量标签夹，并带有精细的键盘和鼠标动作注释。我们的模型采用可控的图像到世界生成范式，以参考图像，运动上下文和用户操作为条件。矩阵游戏具有超过170亿个参数，可以精确控制角色动作和相机运动，同时保持高视觉质量和时间连贯性。为了评估性能，我们开发了GameWorld Score，这是一个统一的基准测试，可测量Minecraft World Genereth的视觉质量，时间质量，动作可控性和物理规则理解。广泛的实验表明，矩阵游戏在所有指标上始终优于先前的开源Minecraft World模型（包括Oasis和Mineworld），在可控性和物理一致性方面的提高尤其强劲。双盲人类评估进一步证实了矩阵游戏的优越性，突出了其在各种游戏场景中生成感知现实且可控制的视频的能力。为了促进对交互式图像到世界一代的未来研究，我们将在https://github.com/skyworkai/matrix-game上开放矩阵游戏模型权重和GameWorld得分基准。

FineWeb2：一个将它们全部扩展的管道 - 将预培训数据处理调整为每种语言

标题: FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language
作者: Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
日期: 2025-06-26
ArXiv主页: https://arxiv.org/abs/2506.20920
论文链接: https://arxiv.org/pdf/2506.20920
项目链接: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

英文摘要

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

中文摘要

培训前最先进的大语言模型（LLMS）需要大量的干净和多样化的文本数据。虽然大型英语预训练数据集的开放开发已经取得了很大的进步，但培训表现的多语言LLMS仍然是一个挑战，在很大程度上是由于对大量语言的固定过滤和重复解说管道的固有困难。在这项工作中，我们引入了基于FineWeb的新的预训练数据集策展管道，可以自动调整以支持任何语言。我们在一组九种不同的语言上，广泛地消除了管道设计选择，并以一系列有意义且信息丰富的评估任务为指导，这些任务是通过基于可测量标准的新型选择过程选择的。最终，我们证明我们的管道可用于创建与先前数据集更具性能模型的非英语语料库。另外，我们还引入了一种简单明了的原则方法来重新平衡数据集，该方法既考虑重复数量又有质量，从而提供了额外的性能提升。最后，我们使用近100个常见的爬网快照来将管道扩展到1000多种语言，以生产FineWeb2，这是一种新的20 trabyte（50亿个文档）多语言数据集，我们将与管道，培训和评估准则一起发布。

sharegpt-4o图像：与GPT-4O级图像生成对齐多模式

标题: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
作者: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
日期: 2025-06-22
ArXiv主页: https://arxiv.org/abs/2506.18095
论文链接: https://arxiv.org/pdf/2506.18095
gitHub仓库: https://github.com/FreedomIntelligence/ShareGPT-4o-Image

英文摘要

Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o’s image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.

中文摘要

多模式生成模型的最新进展已解锁了与教学的图像产生，但诸如GPT-4O图像之类的领先系统仍然专有且无法访问。为了使这些功能民主化，我们介绍了共享节目4O图像，这是第一个包含45K文本图像图像和46K文本和图像图像数据的数据集，所有数据集都使用GPT-4O的图像生成功能合成，以蒸馏出其高级图像生成能力。利用此数据集，我们开发了Janus-4O，这是一种多模式的大型语言模型，能够具有文本形象和文本和图像形象生成。Janus-4O不仅显着改善了其前身Janus-Pro的文本到图像的生成，而且还新地支持文本和图像形象。值得注意的是，它仅在8台A800-GPU机器上仅使用91K合成样本和6个小时的培训，在从头开始的文本和图像生成中实现了令人印象深刻的性能。我们希望释放ShareGpt-4O图像和Janus-4O的发行能够促进与逼真的，教学一致的图像生成的开放研究。

mmsearch-r1：激励LMMS搜索

标题: MMSearch-R1: Incentivizing LMMs to Search
作者: Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
日期: 2025-06-25
ArXiv主页: https://arxiv.org/abs/2506.20670
论文链接: https://arxiv.org/pdf/2506.20670
gitHub仓库: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1

英文摘要

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

中文摘要

考虑到现实世界中信息的复杂性和动态性质，在现实世界情景中，大型多模型（LMM）的强大部署需要访问外部知识来源。现有的方法，例如检索功能的生成（RAG）和迅速设计的搜索剂，依赖于严格的管道，通常导致搜索行为效率低下或过度。我们提出了MMSEarch-R1，这是第一个端到端的增强学习框架，使LMM可以在现实世界中的Internet环境中执行按需进行多转弯搜索。我们的框架同时集成了图像和文本搜索工具，允许模型推理何时以及如何以基于结果的奖励和搜索惩罚引导它们。为了支持培训，我们通过半自动化的管道收集多模式搜索VQA数据集，该管道涵盖了各种视觉和文本知识需求，并使用搜索平衡的子集使用搜索要求和搜索的示例来策划搜索平衡的子集，这对于塑造有效和按需搜索行为而言，这对于塑造有效的和按需搜索行为至关重要。关于知识密集型和寻求信息的VQA任务的广泛实验表明，我们的模型不仅超过了基于抹布的基线相同型号大小，而且还与较大的基于抹布的模型的性能相匹配，同时将搜索调用降低30％以上。我们进一步分析了关键的经验发现，以提供可行的见解，以推进多模式搜索的研究。

Jarvisart：通过智能照片修饰代理解放人类艺术创造力

标题: JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
作者: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan
日期: 2025-06-21
ArXiv主页: https://arxiv.org/abs/2506.17612
论文链接: https://arxiv.org/pdf/2506.17612
项目链接: https://jarvisart.vercel.app/
gitHub仓库: https://github.com/LYL1015/JarvisArt

英文摘要

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.

中文摘要

照片修饰已成为当代视觉讲故事不可或缺的一部分，使用户能够捕捉美学并表达创造力。尽管Adobe Lightroom等专业工具提供了强大的功能，但它们需要大量的专业知识和手动努力。相比之下，现有的基于AI的解决方案提供了自动化，但通常会遭受有限的可调性和不良的概括，无法满足多样化和个性化的编辑需求。为了弥合这一差距，我们介绍了Jarvisart，这是一种多模式的大语言模型（MLLM）驱动的代理，了解用户意图，模仿专业艺术家的推理过程，并在Lightroom中智能协调200多个修饰工具。Jarvisart经历了一个两阶段的培训过程：最初经过经过经过经过经过经过管理的链条监督的微调，以建立基本的推理和工具使用技能，然后进行小组相对政策优化（GRPO-R），以进一步提高其决策和工具能力。我们还提出了代理到极高的协议，以促进与Lightroom的无缝集成。为了评估性能，我们开发了MMART-BENCH，这是一种由现实世界用户编辑构建的新颖基准。Jarvisart展示了对全球和本地调整的用户友好互动，卓越的概括以及细粒度的控制，为智能照片修饰铺平了新的途径。值得注意的是，它的表现优于GPT-4O，在MMART BENCH上的平均像素级指标提高了60％的内容，同时保持了可比的指导遵循能力。项目页面：https：//jarvisart.vercel.app/。

寄生虫：在视觉生成模型中有效稀疏和量化注意力的模式感知排序

标题: PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
作者: Tianchen Zhao, Ke Hong, Xinhao Yang, Xuefeng Xiao, Huixia Li, Feng Ling, Ruiqi Xie, Siqi Chen, Hongyu Zhu, Yichong Zhang, Yu Wang
日期: 2025-06-19
ArXiv主页: https://arxiv.org/abs/2506.16054
论文链接: https://arxiv.org/pdf/2506.16054
项目链接: https://a-suozhang.xyz/paroattn.github.io/

英文摘要

In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: reorganizing the attention pattern to alleviate the challenges. Inspired by the local aggregation nature of visual feature extraction, we design a novel Pattern-Aware token ReOrdering (PARO) technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, PAROAttention, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density (~20%-30%) and bitwidth (INT8/INT4), achieving a 1.9x to 2.7x end-to-end latency speedup.

中文摘要

在视觉生成中，注意力机制的二次复杂性导致高内存和计算成本，特别是对于高分辨率图像或多帧视频生成中所需的较长令牌序列。为了解决这个问题，之前的研究已经探索了稀疏化和量化等技术。然而，在低密度和减小的位宽下，这些技术面临着重大挑战。通过系统分析，我们发现核心难点源于视觉注意力模式的分散和不规则特征。因此，我们提出了一种替代策略，而不是引入专门的稀疏化和量化设计来适应这些模式：重组注意力模式以缓解挑战。受视觉特征提取的局部聚合特性的启发，我们设计了一种新颖的模式感知令牌重排序（PARO）技术，该技术将不同的注意力模式统一为硬件友好的逐块模式。这种统一大大简化和增强了稀疏化和量化。我们评估了各种设计选择的性能效率权衡，并最终确定了一种为统一模式量身定制的方法。我们的方法“PAROAttention”实现了无损指标的视频和图像生成，并且从全精度（FP）基线获得了几乎相同的结果，同时以明显较低的密度（~20%-30%）和位宽（INT8/INT4）运行，实现了1.9x到2.7x的端到端延迟加速。

Animax：使用联合视频置式扩散模型为3D动画动画

标题: AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models
作者: Zehuan Huang, Haoran Feng, Yangtian Sun, Yuanchen Guo, Yanpei Cao, Lu Sheng
日期: 2025-06-24
ArXiv主页: https://arxiv.org/abs/2506.19851
论文链接: https://arxiv.org/pdf/2506.19851
项目链接: https://anima-x.github.io/
gitHub仓库: https://github.com/anima-x/anima-x

英文摘要

We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: https://anima-x.github.io/{https://anima-x.github.io/}.

中文摘要

我们提出了Animax，这是一种馈送3D动画框架，它将视频扩散模型的运动先验与基于骨架的动画的可控结构相桥。传统运动合成方法要么仅限于固定骨骼拓扑，要么需要在高维变形空间中进行昂贵的优化。相比之下，Animax有效地将基于视频的运动知识转移到了3D域，从而支持与任意骨架的各种铰接式网格。我们的方法将3D运动表示为多视图，多帧2D姿势图，并启用以模板渲染和文本运动提示为条件的联合视频置式扩散。我们介绍共享的位置编码和模态感知的嵌入，以确保视频和姿势序列之间的时空对齐，从而有效地将视频先验转移到运动生成任务。所得的多视图姿势序列将三角构为3D关节位置，并通过反向运动学转换为网状动画。Animax在新策划的160,000个操纵序列的新策划数据集上进行了培训，可在概括，运动保真度和效率的VBench上获得最新的结果，为类别 - 敏捷的3D动画提供了可扩展的解决方案。项目页面：https：//anima-x.github.io/ {https://anima-x.github.io/}。

长作者零：通过加强学习来掌握超长的文本生成

标题: LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
作者: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18841
论文链接: https://arxiv.org/pdf/2506.18841

英文摘要

Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ‘‘teaching’’, which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B

中文摘要

大型语言模型（LLMS）的超长发电是一种广泛要求的情况，但由于序列长度的增加，由于其最大生成长度限制和总体质量下降，这仍然是一个重大挑战。以前的方法是由长作者举例说明的，通常依赖于“教学”，这涉及对合成长期输出的监督微调（SFT）。但是，这种策略在很大程度上取决于合成的SFT数据，这很难构建，通常缺乏连贯性和一致性，并且往往过于人为和结构单调。在这项工作中，我们提出了一种基于激励化的方法，该方法完全从头开始，而不依赖于任何注释或合成数据，利用强化学习（RL）来促进LLMS超长，高质量的文本生成能力的出现。我们从类似于R1-Zero的基本模型开始进行RL培训，从而指导其进行推理，从而有助于在写作过程中进行计划和精致。为了支持这一点，我们采用了专门的奖励模型，这些模型将LLM转向改善长度控制，写作质量和结构格式。实验评估表明，我们的长作者零型模型，经过QWEN2.5-32B培训，一致性地优于传统的SFT方法，在写作板和竞技场上实现了所有指标的最先进的结果，甚至超过了100B+模型，例如DeepSeek R1 R1和Qwen3-23-23-23-23-23-23-235.B。我们在https://huggingface.co/thu-keg/longwriter-zero-32b下开放数据和模型检查点

Hunyuan-Gamecraft：具有混合历史状况的高动力互动游戏视频生成

标题: Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition
作者: Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, Qinglin Lu
日期: 2025-06-20
ArXiv主页: https://arxiv.org/abs/2506.17201
论文链接: https://arxiv.org/pdf/2506.17201

英文摘要

Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.

中文摘要

基于扩散和可控的视频生成的最新进展使高质量和时间连贯的视频综合，为沉浸式互动游戏体验奠定了基础。但是，当前的方法面临动态，一般性，长期一致性和效率的限制，这些方法限制了创建各种游戏视频的能力。为了解决这些差距，我们介绍了Hunyuan-Gamecraft，这是一个新颖的框架，用于游戏环境中高动态互动视频的生成。为了获得细粒度的动作控制，我们将标准键盘和鼠标输入统一到共享的相机表示空间中，从而促进了各种相机和运动操作之间的平滑插值。然后，我们提出了一个混合历史条件的培训策略，该策略在保留游戏场景信息的同时，自动摄取视频序列。此外，为了提高推断效率和可玩性，我们实现了模型蒸馏以减少计算开销，同时保持跨时间序列的一致性，从而适合在复杂的交互式环境中实时部署。该模型在一个大规模数据集上进行了培训，该数据集在100多个AAA游戏中包含超过100万个游戏录制，从而确保了广泛的覆盖范围和多样性，然后在精心注释的合成数据集中进行了微调，以增强精度和控制。精心策划的游戏场景数据可显着提高视觉保真度，现实主义和动作可控性。广泛的实验表明，Hunyuan-Gamecraft极大地胜过现有模型，推动了交互式游戏视频生成的现实性和可玩性。

Mind2Web 2：用AS-A-A-Gudge评估代理搜索

标题: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
作者: Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
日期: 2025-06-26
ArXiv主页: https://arxiv.org/abs/2506.21506
论文链接: https://arxiv.org/pdf/2506.21506
项目链接: https://osu-nlp-group.github.io/Mind2Web-2

英文摘要

Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

中文摘要

代理搜索，例如深层研究系统，大型语言模型自主浏览网络，合成信息并返回全面的引用支持的答案，这代表了用户与Web级信息交互的方式发生了重大转变。尽管有望提高效率和认知卸载，但代理搜索的日益增长的复杂性和开放性超过了现有的评估基准和方法，这在很大程度上假设了短暂的搜索范围和静态答案。在本文中，我们介绍了Mind2Web 2，这是130个现实，高质量和长期任务的基准，这些任务需要实时的Web浏览和广泛的信息综合，并以1000多个小时的人工劳动构建。为了应对评估时变和复杂答案的挑战，我们提出了一个新颖的代理人判断框架。我们的方法基于树结构的标题设计构建特定于任务的法官代理，以自动评估答案正确性和源归因。我们对九个边境销售系统和人类绩效进行了全面评估，并进行了详细的错误分析，以了解未来发展的见解。表现最佳的系统，OpenAI深入研究，在花费一半的时间的同时，已经可以实现50-70％的人类绩效，表现出很大的潜力。总的来说，Mind2Web 2为开发和基准测试下一代代理搜索系统提供了严格的基础。

Skywork-SWE：揭示LLMS软件工程的数据扩展法

标题: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
作者: Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
日期: 2025-06-24
ArXiv主页: https://arxiv.org/abs/2506.19290
论文链接: https://arxiv.org/pdf/2506.19290
项目链接: https://quixotic-sting-239.notion.site/eb17f379610040ceb54da5d5d24065bd

英文摘要

Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model’s performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

中文摘要

软件工程（SWE）最近成为下一代LLM代理的关键测试台，要求在两个关键维度中固有的能力：持续的迭代问题解决（例如，> 50个互动巡回赛）和长篇文章依赖性分辨率（例如，> 32k tokens）。但是，SWE中的数据策划过程仍然暂时耗时，因为它严重依赖于代码文件过滤的手动注释以及专用运行时环境的设置来执行和验证单元测试。因此，大多数现有数据集仅限于几千个Github采购的实例。为此，我们提出了一个增量，自动化的数据策略管道，该管道系统地缩放了SWE数据集的音量和多样性。我们的数据集包括10,169个现实世界中的Python任务实例，来自2,531个不同的GitHub存储库，每个杂志都伴随着具有自然语言指定的任务和用于自动单位测试验证的专用运行时环境图像。我们已经仔细策划了超过8,000个从我们提出的SWE数据集中成功验证的验证训练轨迹。在对这些轨迹上的SkyWork-SWE模型进行微调时，我们发现了一个引人注目的数据扩展现象：受过训练的LLMS软件工程功能的训练模型的性能会随着数据大小的增加而继续提高，没有显示饱和迹象。值得注意的是，我们的SKYWORK-SWE模型在未使用验证者或多个推出的情况下，在SWE-Bench验证的基准测试中获得了38.0％的通行证，在OpenHands Agents Agents框架上建立了QWEN2.5-CODER-32B的LLMS之间的新最先进（SOTA）。此外，随着测试时间缩放技术的结合，性能进一步提高到47.0％的精度，超过了以下32B参数模型的先前SOTA结果。我们发布Skywork-SWE-32B模型检查站，以加速未来的研究。

OCTOPHINKER：中期训练激励增强学习缩放

标题: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
作者: Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
日期: 2025-06-25
ArXiv主页: https://arxiv.org/abs/2506.20512
论文链接: https://arxiv.org/pdf/2506.20512
gitHub仓库: https://github.com/GAIR-NLP/OctoThinker

英文摘要

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

中文摘要

不同的基本语言模型家族，例如骆驼和QWEN，在加强学习后培训期间表现出不同的行为（RL），尤其是在推理密集型任务上。是什么使基本语言模型适合加固学习？深入了解这个问题对于开发下一代的RL尺度基础模型至关重要。在这项工作中，我们研究了中期训练策略如何塑造RL动态，重点是两个代表性的模型家族：Qwen和Llama。我们的研究表明，（1）高质量的数学语料库，例如Megamath-Web-Pro，显着改善了基本模型和RL性能，而现有替代方案（例如Finemath-4plus）未能做到这一点；（2）进一步添加QA风格的数据，尤其是长期的经过思考链（COT）推理示例，增强了RL结果，并进一步解锁了此效果；（3）虽然长时间提高了推理的深度，但它也可以引起模型响应的冗长和RL训练的不稳定，从而强调了数据格式的重要性；（4）扩展中期训练会始终导致更强的下游RL性能。在这些见解的基础上，我们引入了一种两阶段的中期训练策略，即稳定，然后是稳定的，其中首先在200B代币上以恒定的学习率进行了基本模型，然后在三个具有学习率衰减的COT的分支上进行了20B令牌。这产生了Octothinker，这是一个模型家族，表现出强大的RL兼容性，并使用更友好的RL友好模型系列（即Qwen）缩小了性能差距。我们希望我们的工作将有助于在RL时代制定基础模型的训练前策略。为了支持进一步的研究，我们发布了我们的开源模型以及超过700亿代币（即Megamath-Web-Pro-Max）的精选数学推理密集型语料库。

大型语言模型可靠的4位量化的外部安全预培训

标题: Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
作者: Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
日期: 2025-06-24
ArXiv主页: https://arxiv.org/abs/2506.19697
论文链接: https://arxiv.org/pdf/2506.19697
gitHub仓库: https://github.com/dmis-lab/Outlier-Safe-Pre-Training?tab=readme-ov-file#using-vllm

英文摘要

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

中文摘要

大型语言模型（LLMS）中的极端激活异常值严重降低了量化性能，阻碍了有效的智障部署。尽管渠道操作和自适应梯度缩放是公认的原因，但实际缓解仍然具有挑战性。我们介绍了离群机安全前训练（OSP），这是一个实用的指南，可主动防止异常形成而不是依靠事后缓解。OSP结合了三个关键创新：（1）MUON优化器，在维持培训效率的同时消除了特权基础；（2）单尺度的rmsnorm，防止通道扩增；（3）可学习的嵌入投影，重新分布激活幅度，源自嵌入矩阵。我们通过训练1万亿个代币的1.4B参数模型来验证OSP，这是第一个没有这种离群值的生产规模的LLM。在积极的4位量化下，我们的OSP模型在10个基准测试中获得了35.7个平均得分（相比之下，亚当训练的模型为26.5），只有2％的训练开销。值得注意的是，与极端值相比，OSP模型在标准模型中表现出接近零过量的峰度（0.04），从根本上改变了LLM量化行为。我们的工作表明，离群值不是LLM固有的，而是培训策略的后果，为更有效的LLM部署铺平了道路。源代码和验证的检查点可在https://github.com/dmis-lab/outlier-safe-pre-training上找到。

逆编辑：通过周期一致性模型进行有效且快速的图像编辑

标题: Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models
作者: Ilia Beletskii, Andrey Kuznetsov, Aibek Alanov
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.19103
论文链接: https://arxiv.org/pdf/2506.19103
gitHub仓库: https://github.com/ControlGenAI/Inverse-and-Edit

英文摘要

Recent advances in image editing with diffusion models have achieved impressive results, offering fine-grained control over the generation process. However, these methods are computationally intensive because of their iterative nature. While distilled diffusion models enable faster inference, their editing capabilities remain limited, primarily because of poor inversion quality. High-fidelity inversion and reconstruction are essential for precise image editing, as they preserve the structural and semantic integrity of the source image. In this work, we propose a novel framework that enhances image inversion using consistency models, enabling high-quality editing in just four steps. Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy and enables a controllable trade-off between editability and content preservation. We achieve state-of-the-art performance across various image editing tasks and datasets, demonstrating that our method matches or surpasses full-step diffusion models while being substantially more efficient. The code of our method is available on GitHub at https://github.com/ControlGenAI/Inverse-and-Edit.

中文摘要

通过扩散模型的图像编辑的最新进展取得了令人印象深刻的结果，从而对生成过程提供了细粒度的控制。但是，由于其迭代性质，这些方法在计算上是密集的。尽管蒸馏扩散模型可以更快地推断，但它们的编辑功能仍然有限，这主要是由于反转质量差。高保真倒置和重建对于精确图像编辑至关重要，因为它们保留了源图像的结构和语义完整性。在这项工作中，我们提出了一个新颖的框架，该框架可以使用一致性模型来增强图像反演，从而仅需四个步骤即可进行高质量的编辑。我们的方法介绍了一种周期抗性优化策略，该策略可显着提高重建精度，并在编辑性和内容保存之间实现可控制的权衡。我们在各种图像编辑任务和数据集中实现了最新的性能，表明我们的方法匹配或超过了全步扩散模型，同时又更加有效。我们方法的代码可在https://github.com/controlgenai/inverse-and-edit上在github上获得。

Fasta^*：带有子例程挖掘的快速慢刀具路径，用于有效的多圈图像编辑

标题: FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
作者: Advait Gupta, Rishie Raj, Dang Nguyen, Tianyi Zhou
日期: 2025-06-26
ArXiv主页: https://arxiv.org/abs/2506.20911
论文链接: https://arxiv.org/pdf/2506.20911
gitHub仓库: https://github.com/tianyi-lab/FaSTAR

英文摘要

We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.‘’ It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A^* search per subtask to find a cost-efficient toolpath – a sequence of calls to AI tools. To save the cost of A^* on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A^* search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA^*‘’: fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A^* search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA^* is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

中文摘要

我们开发了一种具有成本效率的神经成像剂，以解决具有挑战性的多转弯图像编辑任务，例如“检测图像中的长凳，同时将其重新上色为粉红色。此外，请删除猫以清晰的视野，并将墙壁重新上色为黄色。具有成本效益的工具路径 - 一系列对AI工具的调用，以节省类似子任务的A^*的成本，我们通过LLMS在先前成功的工具路径上执行感应推理搜索被激活。可重复使用的符号子例程可在适用于相似图像的相同类型的子任务上节省探索成本，产生类似人类的快速慢刀工具路径“ fasta^{*‘’：快速子任务计划，然后是基于规则的子例程选择，而llms又启用了基于规则的子例程，而这只能涵盖大多数范围的范围。子任务。通过与最近的图像编辑方法进行比较，我们证明了Fasta}*在计算上显着有效，同时在成功率方面与最先进的基线保持了竞争。

专家链：解锁Experts模型混合物的通信能力

标题: Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
作者: Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18945
论文链接: https://arxiv.org/pdf/2506.18945

英文摘要

We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model’s representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE’s benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

中文摘要

我们提出了Experts（COE），这是一种新的Experts（MOE）体系结构，该体系结构在每一层中介绍了连续的专家通信。与传统的MOE模型（专家在并行地独立运作）不同，COE在一层内部的一系列专家中迭代地迭代。为了支持跨越迭代的动态专家选择，COE在一层的每个迭代步骤中采用专用路由器。这种设计允许代币在每次迭代期间重新评估和选择不同的专家，而不是静态分配。结果，COE引入了一种灵活的路由机制，该机制增加了专家组合的多样性并丰富了模型的代表能力。COE证明了固定计算下的性能提高：在数学推理任务上，与标准MOE相比，它将验证损失从1.20降低到1.12。超越性能，COE提供了一个新的缩放轴：通过专家迭代的深度，它可以补充常规的宽度/深度缩放。例如，使用2倍迭代匹配3倍专家选择（宽度）的性能，同时相对于其他缩放策略，将内存使用率降低了17.6-42％。我们的分析表明，COE的益处源于其迭代性残差结构和增强的专业专业化，并通过迭代路由赋予了功能，这些路由共同解锁了更具表现力的表示。代码可从https://github.com/zihanwang314/coe获得。

Worldvla：迈向自动进取行动世界模型

标题: WorldVLA: Towards Autoregressive Action World Model
作者: Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, Hao Chen
日期: 2025-06-26
ArXiv主页: https://arxiv.org/abs/2506.21539
论文链接: https://arxiv.org/pdf/2506.21539
项目链接: https://github.com/alibaba-damo-academy/WorldVLA
gitHub仓库: https://github.com/alibaba-damo-academy/WorldVLA

英文摘要

We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model’s limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.

中文摘要

我们提出WorldVla，这是一种自回归的行动世界模型，它统一了行动，图像理解和产生。我们的WorldVLA将视觉语言动作（VLA）模型和世界模型与一个单一框架相互融合。世界模型通过利用动作和图像理解来预测未来的图像，以学习环境的潜在物理学以改善动作的产生。同时，该动作模型基于图像观察产生了后续的动作，帮助视觉理解和有助于视觉生成世界模型。我们证明，WorldVLA优于独立的行动和世界模型，强调了世界模型与动作模型之间的相互增强。此外，我们发现动作模型的性能在以自回归方式生成动作序列时会恶化。这种现象可以归因于该模型对动作预测的有限概括能力，从而导致从早期动作到后续行动的错误传播。为了解决这个问题，我们提出了一种注意力面罩策略，该策略有选择地掩盖了当前动作期间先前的动作，该动作显示了动作块生成任务的绩效改善。

Madrive：内存启动驾驶场景建模

标题: MADrive: Memory-Augmented Driving Scene Modeling
作者: Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk
日期: 2025-06-26
ArXiv主页: https://arxiv.org/abs/2506.21520
论文链接: https://arxiv.org/pdf/2506.21520
项目链接: https://yandex-research.github.io/madrive/

英文摘要

Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of {sim}70K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

中文摘要

现场重建的最新进展已朝着使用3D高斯分裂的自主驾驶（AD）环境的高度现实建模。但是，所得的重建仍然与原始观察结果紧密相关，并努力支持明显改变或新颖的驾驶场景的光真逼真的综合。这项工作介绍了Madrive，这是一个由内存的重建框架，旨在通过从大型外部存储库中检索到的视觉上相似的3D资产来替换现有场景重建方法的功能。Specifically, we release MAD-Cars, a curated dataset of {sim}70K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting.所得更换提供了场景中车辆的完整多视图表示形式，从而实现了对实验中发生的实质性变化构成的逼真的综合。项目页面：https：//yandex-research.github.io/madrive/

Viki-R：通过增强学习协调具体的多代理合作

标题: VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
作者: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
日期: 2025-06-10
ArXiv主页: https://arxiv.org/abs/2506.09049
论文链接: https://arxiv.org/pdf/2506.09049
项目链接: https://faceong.github.io/VIKI-R/
gitHub仓库: https://github.com/MARS-EAI/VIKI-R

英文摘要

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

中文摘要

在动态环境中协调多种体现的代理仍然是人工智能中的核心挑战，需要感知驱动的推理和可扩展的合作策略。尽管最近的作品利用大型语言模型（LLM）进行多代理计划，但一些人已经开始探索视觉模型（VLMS）进行视觉推理。但是，这些基于VLM的方法在支持各种实施方案类型方面仍然有限。在这项工作中，我们介绍了Viki-Bench，这是针对体现多代理合作量身定制的第一个分层基准，具有三个结构化级别：代理激活，任务计划和轨迹感知。Viki Bench包括不同的机器人实施例，多视觉视觉观察和结构化监督信号，以评估以视觉输入为基础的推理。为了证明Viki Bench的实用性，我们提出了Viki-R，Viki-R是一个两阶段的框架，该框架使用经过思考的注释的演示进行了验证的视觉模型（VLM），然后在多级奖励信号下进行增强学习。我们的广泛实验表明，Viki-R在所有任务级别上的表现明显优于基线方法。此外，我们表明，强化学习能够在异质代理中出现组成合作模式。Viki-Bench和Viki-R一起提供了一种统一的测试床和方法，可在体现的AI系统中推进多机构，视觉驱动的合作。

OAGENTS：建筑有效代理的实证研究

标题: OAgents: An Empirical Study of Building Effective Agents
作者: He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.15741
论文链接: https://arxiv.org/pdf/2506.15741

英文摘要

Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.

中文摘要

最近，Agesic AI已成为一个越来越受欢迎的研究领域。但是，我们认为当前的代理研究实践缺乏标准化和科学严格，因此很难在方法之间进行公平的比较。结果，尚不清楚代理框架中不同的设计选择如何影响有效性，并且衡量其进度仍然具有挑战性。在这项工作中，我们对Gaia基准和BrowseComp进行了系统的实证研究，以公平而严格的方式检查流行设计选择在关键代理组件中的影响。我们发现，缺乏标准评估协议使以前的作品，甚至是开源的，即不可再生的，随机运行之间存在显着差异。因此，我们引入了更强大的评估协议来稳定比较。我们的研究揭示了哪些组件和设计对于有效的代理至关重要，而其他组件和设计似乎是逻辑的。根据我们的发现，我们构建和开源Oagents，这是一个新的基础代理框架，可在开源项目中实现最先进的性能。Oagents为各种代理组件提供了模块化设计，从而促进了代理AI的未来研究。

视觉作为方言：通过文本对准表示统一视觉理解和生成

标题: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
作者: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18898
论文链接: https://arxiv.org/pdf/2506.18898
项目链接: https://tar.csuhan.com
gitHub仓库: https://github.com/csuhan/Tar

英文摘要

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model’s (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

中文摘要

本文提出了一个多模式框架，该框架试图在共享的离散语义表示中统一视觉理解和生成。其核心是文本一致的令牌（TA-TOK），它使用从大语言模型（LLM）词汇中投影的文本一致的代码簿将图像转换为离散令牌。通过将视觉和文本与扩展的词汇相结合到统一的空间，我们的多模式LLM TAR可以通过共享界面启用交叉模式输入和输出，而无需特定于模态的设计。此外，我们提出了尺度自适应编码和解码，以平衡效率和视觉细节，以及生成性的dekokenizer，以产生高保真的视觉输出。为了满足各种解码需求，我们利用了两个互补的De-Tokenizer：快速自回归模型和一个基于扩散的模型。为了增强模态融合，我们研究了先进的预训练任务，并证明了视觉理解和产生的改善。跨基准测试的实验表明，焦油匹配或超过了现有的多模式LLM方法，可实现更快的收敛性和更高的训练效率。代码，模型和数据可在https://tar.csuhan.com上找到

RLPR：将RLVR推送到没有验证者的通用域

标题: RLPR: Extrapolating RLVR to General Domains without Verifiers
作者: Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
日期: 2025-06-23
ArXiv主页: https://arxiv.org/abs/2506.18254
论文链接: https://arxiv.org/pdf/2506.18254
项目链接: https://github.com/OpenBMB/RLPR
gitHub仓库: https://github.com/OpenBMB/RLPR

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM’s intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM’s own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

中文摘要

通过可验证的奖励（RLVR）进行的增强学习表现出有望在推进LLM的推理能力方面的潜力。但是，它的成功仍然限于数学和代码域。这一主要限制源于对域特异性验证仪的严重依赖，这导致了过于复杂性和有限的可扩展性。为了应对挑战，我们的关键观察是，LLM生成正确的自由形式答案的内在概率直接表明其对推理奖励的评估（即推理过程如何导致正确的答案）。在此洞察力的基础上，我们提出了RLPR，这是一个简单的无验证框架，将RLVR推送到更广泛的一般域。RLPR使用LLM自己的令牌概率分数作为参考答案，作为奖励信号，并在培训期间最大化预期的奖励。我们发现，解决这种嘈杂的概率奖励的较高差异对于使其起作用至关重要，并提出了概率到奖励和稳定的方法，以确保LLM内在概率获得精确稳定的奖励。四个通用域基准和三个数学基准的全面实验表明，RLPR始终提高基于Gemma，Llama和QWEN模型的两个领域的推理能力。值得注意的是，RLPR在定理上同时验证了7.6点，而在密涅瓦（Minerva）上的表现为7.5分，甚至超过了强烈的验证者模型依赖性方法，在七个基准中，平均得分平均得分为1.6点。

查看全文

http://www.dtcms.com/a/395178.html