当前位置：首页 > news >正文

【论文速递】2025年第21周(May-18-24)(Robotics/Embodied AI/LLM)

news 2025/9/22 7:07:29

中文使用 googletrans 翻译，翻译不对的地方以英文为准

QWEN3技术报告
- 英文摘要
- 中文摘要
统一的多模式预处理中的新兴特性
- 英文摘要
- 中文摘要
语言模型的模型链学习
- 英文摘要
- 中文摘要
NovelSeek：当代理人成为科学家时 - 构建闭环系统从假设到验证
- 英文摘要
- 中文摘要
Web-Shepherd：推进PRMS用于加强网络代理
- 英文摘要
- 中文摘要
MMADA：多模式大扩散语言模型
- 英文摘要
- 中文摘要
自适应：推理模型可以学习何时思考
- 英文摘要
- 中文摘要
缩放法律用于量化培训
- 英文摘要
- 中文摘要
sageattention3：显微镜FP4注意推理和对8位训练的探索
- 英文摘要
- 中文摘要
缩放推理，失去控制：在大型推理模型中评估指令之后
- 英文摘要
- 中文摘要
GuardReasoner-VL：通过加强推理保护VLM
- 英文摘要
- 中文摘要
adacot：帕累托最佳的自适应链通过增强学习触发
- 英文摘要
- 中文摘要
工具星：通过加强学习赋予LLM脑的多工具推理器
- 英文摘要
- 中文摘要
视觉计划：让我们只用图像思考
- 英文摘要
- 中文摘要
扩散与自回归语言模型：文本嵌入观点
- 英文摘要
- 中文摘要
mmlongbench：对长篇文化视觉语言模型进行基准测试
- 英文摘要
- 中文摘要
像素推理器：通过好奇心驱动的强化学习激励像素空间推理
- 英文摘要
- 中文摘要
UNIVG-R1：通过增强学习的推理指导通用视觉接地
- 英文摘要
- 中文摘要
毫无意义：LLM学习何时思考
- 英文摘要
- 中文摘要
DELTA注意：Delta校正的快速准确的稀疏注意力推断
- 英文摘要
- 中文摘要
Kris-Bench：基准测试下一级智能图像编辑模型
- 英文摘要
- 中文摘要
通过用户界面分解和合成来缩放计算机使用接地
- 英文摘要
- 中文摘要
有效的计算机使用代理培训
- 英文摘要
- 中文摘要
QuickVideo：使用系统算法共同设计的实时视频理解
- 英文摘要
- 中文摘要
这次是不同的：时间序列基础模型的可观察性观点
- 英文摘要
- 中文摘要
模型合并大型语言模型的预培训
- 英文摘要
- 中文摘要
奖励推理模型
- 英文摘要
- 中文摘要
更快的视频扩散，可训练的稀疏关注
- 英文摘要
- 中文摘要

QWEN3技术报告

标题: Qwen3 Technical Report
作者: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu
日期: 2025-05-14
ArXiv主页: https://arxiv.org/abs/2505.09388
论文链接: https://arxiv.org/pdf/2505.09388
项目链接: https://qwenlm.github.io/blog/qwen3/
gitHub仓库: https://github.com/QwenLM/Qwen3

英文摘要

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models–such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)–and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

中文摘要

在这项工作中，我们介绍了Qwen3，这是QWEN模型系列的最新版本。QWEN3包括一系列旨在提高性能，效率和多语言功能的大型语言模型（LLM）。QWEN3系列包括稠密和混合物（MOE）架构的模型，参数尺度范围为0.6至2350亿。QWEN3中的一个关键创新是整合思维模式（用于复杂，多步推理）和非思考模式（用于快速，上下文驱动的响应）中的统一框架。这消除了在不同模型之间切换的需求，例如聊天优化的模型（例如GPT-4O）和专用的推理模型（例如QWQ-32B） - 并根据用户查询或聊天模板启用动态模式切换。同时，QWEN3引入了一种思维预算机制，使用户可以在推理过程中自适应地分配计算资源，从而根据任务复杂性平衡延迟和性能。此外，通过利用旗舰模型中的知识，我们大大减少了建立较小规模的模型所需的计算资源，同时确保其竞争激烈的性能。经验评估表明，QWEN3在不同的基准中取得了最新的结果，包括代码生成，数学推理，代理任务等方面的任务，与更大的MOE模型和专有模型竞争。与其前身QWEN2.5相比，QWEN3将多语言支持从29种到119种语言和方言扩展，从而通过提高跨语性理解和发电能力来增强全球可访问性。为了促进可重复性和社区驱动的研发，所有QWEN3模型均可在Apache 2.0下公开访问。

统一的多模式预处理中的新兴特性

标题: Emerging Properties in Unified Multimodal Pretraining
作者: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14683
论文链接: https://arxiv.org/pdf/2505.14683

英文摘要

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

中文摘要

统一多模式的理解和产生在尖端专有系统中表现出了令人印象深刻的能力。在这项工作中，我们介绍了Bagel，Bagel是一种Open0Source的基础模型，该模型本地支持多模式的理解和产生。Bagel是一种统一的，解码的模型，该模型是在大型0尺度交织文本，图像，视频和Web数据中策划的数万亿个代币的预测模型。当用如此多样化的多模式交织数据缩放时，百吉饼在复杂的多模式推理中表现出新兴的能力。结果，它在多模式生成和跨标准基准测试中的开源统一模型都显着优于开源统一模型，同时表现出高级的多模式推理能力，例如自由形式图像操纵，未来的框架预测，3D操纵和世界导航。为了促进多模式研究的进一步机会，我们分享了关键的发现，预处理细节，数据创建原始的，并将我们的代码和检查站发布给社区。项目页面在https://bagel-ai.org/

语言模型的模型链学习

标题: Chain-of-Model Learning for Language Model
作者: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
日期: 2025-05-17
ArXiv主页: https://arxiv.org/abs/2505.11820
论文链接: https://arxiv.org/pdf/2505.11820
gitHub仓库: https://github.com/microsoft/CoLM

英文摘要

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.

中文摘要

在本文中，我们提出了一种新颖的学习范式，称为模型链（COM），该范式将因果关系纳入每一层的隐藏状态作为链式样式，从而在模型训练中引入了极大的扩展效率，并在部署中推断了灵活性。我们介绍了代表链（COR）的概念，该概念将每个层的隐藏状态制定为在隐藏维度级别上多个子代理（即链）的组合。在每一层中，来自输出表示的每个链只能在输入表示中查看其所有先前的链。因此，基于COM框架构建的模型可以通过基于先前的模型（即链）增加链条逐渐扩大模型大小，并通过使用不同的链数来提供多个以不同尺寸的弹性推理的子模型。基于这一原则，我们设计了语言链模型（COLM），该链将COM的想法纳入了变压器体系结构的每个层。基于COLM，我们通过引入KV共享机制进一步引入Colm-Air，该机制计算第一个链中的所有键和值，然后在所有链中共享。该设计显示了其他可扩展性，例如启用无缝的LM开关，预填充加速度等。实验结果表明，我们的COLM家族可以实现与标准变压器的可比性能，同时可以实现更大的灵活性，例如渐进式扩展以提高训练效率，并为弹性推理提供多个变化的模型大小，为构建语言模型的新方法铺平了一种方法。我们的代码将来将在以下网址发布：https：//github.com/microsoft/colm。

NovelSeek：当代理人成为科学家时 - 构建闭环系统从假设到验证

标题: NovelSeek: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification
作者: NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
日期: 2025-05-22
ArXiv主页: https://arxiv.org/abs/2505.16938
论文链接: https://arxiv.org/pdf/2505.16938

英文摘要

Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

中文摘要

人工智能（AI）正在加快科学研究范式的转变，不仅提高了研究效率，而且还推动了创新。我们介绍了NovelSeek，这是一个统一的闭环多代理框架，以在各个科学研究领域进行自主科学研究（ASR），使研究人员能够以前所未有的速度和精确度解决这些领域中复杂的问题。NovelSeek强调了三个关键优势：1）可伸缩性：NovelSeek在12个科学研究任务中证明了其多功能性，能够产生创新的思想以增强基线代码的性能。2）互动性：NovelSeek为自动端到端流程中人类专家反馈和多代理互动提供了界面，从而使域专家知识的无缝集成。3）效率：与人类努力相比，NovelSeek在几个科学领域取得了有希望的表现增长。例如，在反应产量预测中，它在短短12小时内从27.6％增加到35.4％。在增强剂活性预测中，精度从0.52上升到0.79，只有4个小时的处理；在2D语义细分中，精度从仅30小时内的78.8％提高到81.0％。

Web-Shepherd：推进PRMS用于加强网络代理

标题: Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
作者: Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
日期: 2025-05-21
ArXiv主页: https://arxiv.org/abs/2505.15277
论文链接: https://arxiv.org/pdf/2505.15277
gitHub仓库: https://github.com/kyle8581/Web-Shepherd

英文摘要

Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.

中文摘要

Web导航是一个独特的领域，可以自动化许多重复的现实生活任务，并且具有挑战性，因为它需要长期胜利的顺序决策，而不是典型的多模式大语言模型（MLLM）任务。然而，到目前为止，在培训和测试时间期间可以使用的Web导航的专门奖励模型。尽管速度和成本效益很重要，但先前的工作还是利用了MLLM作为奖励模型，这对现实世界部署构成了重大限制。为了解决这个问题，在这项工作中，我们提出了一个称为Web-Shepherd的过程奖励模型（PRM），该模型可以在阶梯级中评估Web导航轨迹。为了实现这一目标，我们首先构建了WebPRM集合，这是一个具有40k级别偏好对的大规模数据集和涵盖不同域和难度级别的注释清单。接下来，我们还介绍了WebRewardBench，这是第一个评估PRMS的元评估基准。在我们的实验中，我们观察到，与在WebRewardbench上使用GPT-4O相比，我们的网络居民的精度约为30分。此外，当使用GPT-4O-Mini作为策略和Web-Shepherd作为验证者对Webarena-Lite进行测试时，我们取得了10.9分更好的性能，与使用GPT-4O-MINI作为验证者相比，成本少10个。我们的模型，数据集和代码可在链接上公开可用。

MMADA：多模式大扩散语言模型

标题: MMaDA: Multimodal Large Diffusion Language Models
作者: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
日期: 2025-05-21
ArXiv主页: https://arxiv.org/abs/2505.15809
论文链接: https://arxiv.org/pdf/2505.15809
项目链接: https://huggingface.co/spaces/Gen-Verse/MMaDA
gitHub仓库: https://github.com/Gen-Verse/MMaDA

英文摘要

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model’s ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA’s effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

中文摘要

我们介绍了MMADA，这是一种新颖的多模式扩散基础模型，旨在在文本推理，多模式理解和文本到图像生成等各个领域中实现卓越的性能。该方法通过三个关键创新来区分：（i）MMADA采用具有共同概率表述和模态性设计设计的统一扩散体系结构，从而消除了对模态特异性组件的需求。该体系结构可确保跨不同数据类型的无缝集成和处理。（ii）我们实施了混合的长期思考（COT）微调策略，该策略策划了跨模式的统一COT格式。通过使文本和视觉域之间的推理过程保持一致，该策略促进了最终强化学习（RL）阶段的冷启动训练，从而增强了模型从一开始就处理复杂任务的能力。（iii）我们提出了Unigrpo，这是一种专门针对扩散基础模型量身定制的统一基于策略梯度的RL算法。利用多元化的奖励建模，Unigrpo统一了推理和发电任务的训练后培训，从而确保了一致的绩效提高。实验结果表明，MMADA-8B作为统一的多模式基础模型具有强大的概括能力。它超过了文本推理中的Llama-3-7b和Qwen2-7b等强大的模型，在多模式理解中超过了Show-O和Seed-X，并且在文本到图像生成中擅长SDXL和Janus。这些成就凸显了MMADA在弥合统一扩散体系结构内训练和训练后之间的差距方面的有效性，从而为未来的研究和开发提供了全面的框架。我们在：https：//github.com/gen-verse/mmada上开放代码和训练有素的模型

自适应：推理模型可以学习何时思考

标题: AdaptThink: Reasoning Models Can Learn When to Think
作者: Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li
日期: 2025-05-19
ArXiv主页: https://arxiv.org/abs/2505.13417
论文链接: https://arxiv.org/pdf/2505.13417
gitHub仓库: https://github.com/THU-KEG/AdaptThink

英文摘要

Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

中文摘要

最近，大型推理模型通过采用类似人类的深思熟虑，在各种任务上取得了令人印象深刻的表现。但是，漫长的思维过程大大增加了推理开销，使效率成为关键的瓶颈。在这项工作中，我们首先证明了NotHinking促使推理模型跳过思维并直接生成最终解决方案，这是在性能和效率方面相对简单任务的更好选择。在此激励的情况下，我们提出了一种新颖的RL算法AdaptThink，以教授推理模型，以根据问题难度自适应地选择最佳思维模式。具体而言，AdaptThink具有两个核心组成部分：（1）鼓励模型在保持整体性能的同时选择NotHink的约束优化目标；（2）一种重要的抽样策略，可以平衡思维和无意义的样本在派对培训期间，从而使寒冷开始并允许模型在整个培训过程中探索和利用这两种思维模式。我们的实验表明，自适应大大降低了推理成本，同时进一步提高了性能。值得注意的是，在三个数学数据集中，AdaptThink将DeepSeek-R1-Distill-Qwen-1.5b的平均响应长度降低了53％，并将其准确性提高了2.4％，强调了自适应思维模式选择的希望，以优化推理质量和效率之间的平衡。我们的代码和模型可在https://github.com/thu-keg/adaptthink上找到。

缩放法律用于量化培训

标题: Scaling Law for Quantization-Aware Training
作者: Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14302
论文链接: https://arxiv.org/pdf/2505.14302

英文摘要

Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

中文摘要

大型语言模型（LLMS）需要大量的计算和内存资源，从而构成部署挑战。量化感知培训（QAT）通过在保持性能的同时降低模型精度来解决这些挑战。但是，尚不清楚QAT的缩放行为，尤其是在4位精度（W4A4）时的缩放行为。现有的QAT缩放定律通常会忽略关键因素，例如训练令牌和量化粒度的数量，从而限制了其适用性。本文提出了针对QAT的统一缩放定律，该定律定量将量化误差定为模型大小，训练数据量和量化组大小的函数。通过268个QAT实验，我们表明量化误差随着模型尺寸的增加而降低，但随着更多的训练令牌和更粗的量化粒度而上升。为了识别W4A4量化误差的源，我们将其分解为重量和激活组件。这两个组件都遵循W4A4量化误差的总体趋势，但具有不同的敏感性。具体而言，随着更多的训练令牌，重量量化误差会更快地增加。进一步的分析表明，由异常值引起的FC2层中的激活量化误差是W4A4 QAT量化误差的主要瓶颈。通过应用混合精确量化以解决此瓶颈，我们证明了重量和激活量化误差可以收敛到相似的水平。此外，随着更多训练数据，权重量化误差最终超出了激活量误差，这表明在这种情况下减少权重量误差也很重要。这些发现为改善QAT研究和开发提供了关键见解。

sageattention3：显微镜FP4注意推理和对8位训练的探索

标题: SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
作者: Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
日期: 2025-05-16
ArXiv主页: https://arxiv.org/abs/2505.11594
论文链接: https://arxiv.org/pdf/2505.11594
项目链接: https://github.com/thu-ml/SageAttention
gitHub仓库: https://github.com/thu-ml/SageAttention

英文摘要

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

中文摘要

由于其二次时间的复杂性，注意力的效率很重要。我们通过两个关键贡献提高了注意力的效率：首先，我们利用Blackwell GPU中的新FP4张量核心来加速注意力计算。我们的实施在RTX5090上实现了1038个顶部，这是RTX5090上最快的闪存的5倍加速。实验表明，我们的FP4注意力可以以插件的方式加速各种模型的推断。其次，我们在训练任务上的低位关注。现有的低位注意力如Flashattention3和SageTention仅着眼于推理。但是，训练大型模型的效率也很重要。为了探索是否可以有效地将低位注意力应用于训练任务，我们为前进和向后传播设计了准确有效的8位注意力。实验表明，8位注意力在微调任务中实现无损的表现，但在预训练任务中表现出较慢的收敛性。该代码将在https://github.com/thu-ml/sageattention上找到。

缩放推理，失去控制：在大型推理模型中评估指令之后

标题: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
作者: Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14810
论文链接: https://arxiv.org/pdf/2505.14810
gitHub仓库: https://github.com/TingchenFu/MathIF

英文摘要

Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.

中文摘要

指导跟踪对于将大语言模型（LLM）与用户意图保持一致至关重要。尽管最近以推理为导向的模型在复杂的数学问题上表现出令人印象深刻的表现，但它们遵守自然语言指令的能力仍然没有得到充实的态度。在这项工作中，我们介绍了Mathif，这是一种专门的基准测试，用于评估数学推理任务中的指导跟踪。我们的经验分析揭示了扩大推理能力和保持可控性之间的一致张力，因为这些模型通常更有效地难以遵守用户指令。我们发现，通过蒸馏的长长的链条进行了调整的模型，或者通过以推理为导向的强化学习经常在指导依从性中降低，尤其是在发电长度增加时。此外，我们表明，即使是简单的干预措施也可以部分恢复服从，尽管以推理性能为代价。这些发现突出了当前LLM培训范式中的根本张力，并激发了对更多的指导性推理模型的需求。我们在https://github.com/tingchenfu/mathif上发布代码和数据。

GuardReasoner-VL：通过加强推理保护VLM

标题: GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
作者: Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
日期: 2025-05-16
ArXiv主页: https://arxiv.org/abs/2505.11049
论文链接: https://arxiv.org/pdf/2505.11049
gitHub仓库: https://github.com/yueliu1999/GuardReasoner-VL

英文摘要

To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model’s reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/

中文摘要

为了提高VLM的安全性，本文介绍了一种新颖的基于推理的VLM Guard模型，名为GuardReasonier-VL。核心思想是在通过在线RL做出节制决策之前，将警卫模型激励为有意义的原因。首先，我们构建了具有123K样本和631K推理步骤，跨越文本，图像和文本图像输入的推理语料库。然后，基于它，我们通过SFT启动了模型的推理能力。此外，我们进一步增强了通过在线RL进行节制的推理。具体地说，为了增强样品的多样性和难度，我们通过提出的安全意识数据串联进行排斥采样，然后进行数据增强。此外，我们使用动态剪辑参数来鼓励在早期阶段进行探索，并在后期的阶段进行剥削。为了平衡性能和令牌效率，我们设计了一个长度感知的安全奖励，以整合准确性，格式和代币成本。广泛的实验证明了我们模型的优势。值得注意的是，它平均超过了亚军的F1得分19.27％。我们在https://github.com/yueliu1999/guardreasoner-vl/上发布GuardReasoner-VL的数据，代码和型号（3B/7B）

adacot：帕累托最佳的自适应链通过增强学习触发

标题: AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
作者: Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, Shuangzhi Wu
日期: 2025-05-17
ArXiv主页: https://arxiv.org/abs/2505.11896
论文链接: https://arxiv.org/pdf/2505.11896

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.

中文摘要

大型语言模型（LLM）表现出了非凡的功能，但经常在需要复杂推理的任务上面临挑战。虽然经过思考链（COT）促使推理显着增强了推理，但它不加选择地为所有查询产生了冗长的推理步骤，从而导致了实质性的计算成本和效率低下，尤其是对于简单的输入而言。为了解决这个关键问题，我们引入了ADACOT（自适应链），这是一个新颖的框架，使LLMS能够自适应地决定何时调用COT。ADACOT将自适应推理作为帕累托优化问题，旨在平衡模型性能与与COT调用相关的成本（频率和计算开销）相关的成本。我们提出了一种基于强化学习的方法（RL）方法，专门利用近端策略优化（PPO），通过调整惩罚系数来动态控制COT触发决策边界，从而允许模型基于隐式查询复杂性确定COT的必要性。关键的技术贡献是选择性损失掩蔽（SLM），旨在抵消多阶段RL训练期间的决策边界崩溃，从而确保稳健且稳定的自适应触发。实验结果表明，ADACOT成功地导航了Pareto边境，从而大大减少了不需要精心推理的查询的COT使用情况。例如，在我们的生产交通测试集上，ADACOT将COT触发率降低到3.18 \％，并使平均响应令牌降低69.06％，同时保持了复杂任务的高性能。

工具星：通过加强学习赋予LLM脑的多工具推理器

标题: Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
作者: Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
日期: 2025-05-22
ArXiv主页: https://arxiv.org/abs/2505.16410
论文链接: https://arxiv.org/pdf/2505.16410
项目链接: https://github.com/dongguanting/Tool-Star/
gitHub仓库: https://github.com/dongguanting/Tool-Star

英文摘要

Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.

中文摘要

最近，大型语言模型（LLMS）通过大规模增强学习（RL）表现出了显着的推理能力。但是，利用RL算法来授权LLMS中有效的多工具协作推理能力仍然是一个悬而未决的挑战。在本文中，我们介绍了Tool-Star，这是一种基于RL的框架，旨在使LLMS在逐步推理期间自主调用多个外部工具。工具星将六种类型的工具集成在一起，并将系统设计纳入数据综合和培训。为了解决工具使用数据的稀缺性，我们提出了一个通用工具集成的推理数据合成管道，该管道将工具集成的提示与基于提示的采样结合起来，以自动且可扩展地生成工具使用轨迹。随后的质量归一化和困难的分类过程过滤了低质量的样本，并将数据集从易于到硬化组织。此外，我们提出了一个两阶段的培训框架，以通过以下方式增强多工具协作推理：（1）冷启动微型调整，它指导LLMS通过工具 - 发动机反馈来探索推理模式；（2）具有分层奖励设计的多工具自我批评RL算法，从而增强了奖励理解并促进有效的工具协作。实验性分析对10多种挑战性推理基准进行了强调刀具明星的有效性和效率。该代码可在https://github.com/dongguanting/tool-star上找到。

视觉计划：让我们只用图像思考

标题: Visual Planning: Let’s Think Only with Images
作者: Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
日期: 2025-05-16
ArXiv主页: https://arxiv.org/abs/2505.11409
论文链接: https://arxiv.org/pdf/2505.11409
gitHub仓库: https://github.com/yix8/VisualPlanning

英文摘要

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

中文摘要

大型语言模型（LLM）及其多模式扩展（MLLM）的最新进展具有跨不同任务的机器推理。但是，这些模型主要依靠纯文本作为表达和结构推理的媒介，即使存在视觉信息。在这项工作中，我们认为语言可能并不总是是推理的最自然或有效的方式，尤其是在涉及空间和几何信息的任务中。在此激励的情况下，我们提出了一个新的范式，视觉计划，该范式可以通过独立于文本的纯粹视觉表示进行计划。在此范式中，计划是通过对视觉域中逐步推断的图像序列进行执行的，类似于人类如何绘制或可视化未来的动作。我们介绍了一个新颖的增强学习框架，通过强化学习（VPRL）的视觉计划，并由GRPO授权进行训练后大视力模型，从而在选择代表性的视觉导航任务，Frozenlake，Maze，Maze和Minibehavior方面进行了实质性改进。我们的视觉计划范式优于在文本空间中进行推理的所有其他计划变体。我们的结果将视觉规划确定为基于语言推理的可行且有希望的替代方案，为受益于直观的，基于图像的推理的任务开辟了新的途径。

扩散与自回归语言模型：文本嵌入观点

标题: Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
作者: Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
日期: 2025-05-21
ArXiv主页: https://arxiv.org/abs/2505.15045
论文链接: https://arxiv.org/pdf/2505.15045

英文摘要

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

中文摘要

大型语言模型（LLM）基于大规模培训和训练后受益的嵌入模型已开始超过通用文本嵌入任务（例如文档检索）的基于BERT和T5的模型。但是，LLM嵌入的基本局限性在于自回归预训练期间使用的单向关注，这与文本嵌入任务的双向性质失误。为此，我们建议采用用于文本嵌入的扩散语言模型，这是由于它们固有的双向体系结构以及最近在匹配或超越LLM的成功的动机，尤其是在推理任务上。我们介绍了扩散语言嵌入模型的首次系统研究，该模型的表现优于基于LLM的嵌入模型，在长期检索上的含量为20％，在推理密集型检索方面的表现为8％，在遵循指导遵循的检索中为2％，并在传统的文本嵌入基准上实现竞争性能。我们的分析验证了双向关注对于编码长期和复杂文本中的全球环境至关重要。

mmlongbench：对长篇文化视觉语言模型进行基准测试

标题: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
作者: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
日期: 2025-05-15
ArXiv主页: https://arxiv.org/abs/2505.10610
论文链接: https://arxiv.org/pdf/2505.10610
项目链接: https://zhaowei-wang-nlp.github.io/MMLongBench-page/
gitHub仓库: https://github.com/EdinburghNLP/MMLongBench

英文摘要

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models’ vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

中文摘要

在大型视觉语言模型中，上下文窗口的快速扩展导致了长篇小说视觉模型（LCVLMS），这些模型能够在单个正向通行证中处理数百张带有交错文本令牌的图像。在这项工作中，我们介绍了Mmlongbench，这是第一个涵盖一套长篇文化视觉语言任务的基准，以有效而彻底地评估LCVLMS。Mmlongbench由13,331个示例组成，涵盖了五个不同类别的下游任务，例如Visual抹布和许多弹药ICL。它还提供了图像类型的广泛覆盖范围，包括各种自然图像和合成图像。为了评估模型对不同输入长度的鲁棒性，所有示例均以五个标准化的输入长度（8k-128k令牌）通过跨模式的令牌化方案进行交付，该方案结合了视觉斑块和文本令牌。通过对46个封闭源和开源LCVLM的彻底基准测试，我们对当前模型的视觉语言长篇下说能力进行了全面分析。我们的结果表明：i）单个任务的性能是整体长篇文化功能的弱代理；ii）封闭源和开源模型在长篇文化视觉任务中面临挑战，这表明将来有很大的改进空间；iii）具有更强推理能力的模型倾向于表现出更好的长期性能。通过提供广泛的任务覆盖范围，各种图像类型和严格的长度控制，Mmlongbench为诊断和推进下一代LCVLM的基础提供了缺失的基础。

像素推理器：通过好奇心驱动的强化学习激励像素空间推理

标题: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
作者: Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen
日期: 2025-05-21
ArXiv主页: https://arxiv.org/abs/2505.15966
论文链接: https://arxiv.org/pdf/2505.15966

英文摘要

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model’s initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

中文摘要

经过思考的推理已大大改善了各个领域的大语言模型（LLM）的性能。但是，这个推理过程仅限于文本空间，限制了其在视觉密集任务中的有效性。为了解决此限制，我们在像素空间中介绍了推理的概念。在这个新颖的框架中，视觉模型（VLMS）配备了一套视觉推理操作，例如缩放和精选框架。这些操作使VLM能够直接检查，询问和从视觉证据中推断，从而增强了视觉任务的推理保真度。在VLMS中培养这种像素空间的推理能力提出了显着的挑战，包括该模型最初的不平衡能力及其不愿采用新引入的像素空间操作。我们通过两阶段的训练方法来应对这些挑战。第一阶段使用对合成的推理痕迹进行调整，以使模型熟悉新型视觉操作。此后，增强学习（RL）阶段利用了一个好奇心驱动的奖励方案来平衡像素空间推理和文本推理之间的探索。通过这些视觉操作，VLM可以与复杂的视觉输入（例如信息丰富的图像或视频）进行交互，以主动收集必要的信息。我们证明，这种方法可显着提高各种视觉推理基准的VLM性能。我们的7b模型\型号在V *台上达到84 \％，在Tallyqa-Complex上达到74 \％，在InfophaphicsVQA上达到了84 \％，标志着迄今为止任何开源模型所达到的最高准确性。这些结果强调了像素空间推理的重要性和我们框架的有效性。

UNIVG-R1：通过增强学习的推理指导通用视觉接地

标题: UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
作者: Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14231
论文链接: https://arxiv.org/pdf/2505.14231
项目链接: https://amap-ml.github.io/UniVG-R1-page/
gitHub仓库: https://github.com/AMAP-ML/UniVG-R1

英文摘要

Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.

中文摘要

传统的视觉接地方法主要集中于具有简单文本参考的单片图像。但是，将这些方法扩展到涉及隐式和复杂说明的真实情况，尤其是与多个图像结合使用，这带来了重大挑战，这主要是由于缺乏各种多模式环境中的先进推理能力。在这项工作中，我们旨在解决更实用的通用基础任务，并提出Univg-R1（用于通用视觉接地的推理指导的多模式大型语言模型（MLLM），这通过加强学习（RL）与冷启动数据相结合，增强了推理能力。具体而言，我们首先构建了用详细的推理链注释的高质量链（COT）接地数据集，以通过监督的微调指导模型朝着正确的推理路径。随后，我们执行基于规则的强化学习，以鼓励模型识别正确的推理链，从而激励其推理能力。此外，随着RL培训的进行，我们确定了因简易样本的普遍性而引起的困难偏差，并且我们提出了一种困难的体重调节策略，以进一步增强性能。实验结果证明了Univg-R1的有效性，Univg-R1在MIG基台上实现了最新的性能，比以前的方法提高了9.1％。此外，我们的模型表现出强烈的概括性，在四个图像和视频推理基础基准中，平均零拍摄性能的平均提高为23.4％。可以在https://amap-ml.github.io/univg-r1-page/上访问项目页面。

毫无意义：LLM学习何时思考

标题: Thinkless: LLM Learns When to Think
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
日期: 2025-05-19
ArXiv主页: https://arxiv.org/abs/2505.13379
论文链接: https://arxiv.org/pdf/2505.13379
gitHub仓库: https://github.com/VainF/Thinkless

英文摘要

Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

中文摘要

推理语言模型，能够扩展思想链的推理，在需要复杂的逻辑推理的任务上表现出了出色的表现。但是，对所有查询进行详尽的推理通常会导致大量的计算效率低下，尤其是当许多问题接受直接解决方案时。这激发了一个悬而未决的问题：LLM可以学习何时思考吗？为了回答这一点，我们提出了毫无疑问的框架，这是一个可学习的框架，它使LLM能够根据任务复杂性和模型的能力适应在短形式和长期推理之间选择。在强化学习范式下进行了训练，并采用了两个控制令牌，进行简洁的响应，以详细推理。我们方法的核心是分离的组相对策略优化（DEGRPO）算法，该算法将混合推理的学习目标分解为两个组成部分：（1）控制推理模式选择的控制令牌损失，（2）改善了产生的答案的准确性。这种脱钩的配方可以对每个目标的贡献进行细粒度的控制，稳定训练，并有效防止在香草GRPO中观察到的崩溃。从经验上讲，在Minerva代数，Math-500和GSM8K等几个基准上，Thinkless能够将长链思维的使用降低50％-90％，从而大大提高了推理语言模型的效率。该代码可从https://github.com/vainf/thinkless获得

DELTA注意：Delta校正的快速准确的稀疏注意力推断

标题: Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
作者: Jeffrey Willette, Heejun Lee, Sung Ju Hwang
日期: 2025-05-16
ArXiv主页: https://arxiv.org/abs/2505.11254
论文链接: https://arxiv.org/pdf/2505.11254
gitHub仓库: https://github.com/jeffwillette/delta-attention

英文摘要

The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

中文摘要

变压器的注意机制具有二次复杂性，导致长序列的高推理成本和潜伏期。但是，注意矩阵大多是稀疏的，这意味着可以从计算中省略许多条目以有效地推断。稀疏关注推断方法旨在减轻这种计算负担；但是，他们还带来了麻烦的性能降级。我们发现这种降解的原因之一是稀疏计算引起了注意力输出的分布转移。分配偏移导致解码时间查询无法很好地与预填充阶段的适当键保持一致，从而导致性能下降。我们提出了一个简单，新颖且有效的程序来纠正这种分布转移，从而使稀疏注意力的分布更接近二次注意力。我们的方法可以在任何稀疏注意方法的基础上应用，并导致平均36％的PT性能提高，当在带有水槽令牌的滑动窗户注意力的顶部，而仅添加了一个小的头顶，恢复了131K标尺基准的二次注意精度的88％。我们的方法可以在完全二次的注意力下保持大约98.5％的稀疏度，这使得我们的模型在处理1M令牌预填充时的速度比Flash Rate 2的速度快32倍。

Kris-Bench：基准测试下一级智能图像编辑模型

标题: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
作者: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang
日期: 2025-05-22
ArXiv主页: https://arxiv.org/abs/2505.16707
论文链接: https://arxiv.org/pdf/2505.16707
项目链接: https://yongliang-wu.github.io/kris_bench_project_page/
gitHub仓库: https://github.com/mercurystraw/Kris_Bench

英文摘要

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

中文摘要

多模式生成模型的最新进展已在基于教学的图像编辑中取得了重大进展。但是，尽管这些模型产生了视觉上合理的输出，但它们基于知识的推理编辑任务的能力仍然不足。在本文中，我们介绍了Kris-Bench（图像编辑系统基准中基于知识的推理），这是一种诊断基准测试，旨在通过认知知情的镜头评估模型。Kris-Bench从教育理论中借鉴了三种基础知识类型的编辑任务：事实，概念和程序。基于此分类法，我们设计了22项代表性任务，涵盖了7个推理维度并发布1,267个高质量注释的编辑实例。为了支持细粒度的评估，我们提出了一项综合协议，该方案结合了一种新颖的知识合理度量，并通过知识提示增强并通过人类研究进行了校准。10个最先进模型的经验结果揭示了推理性能的显着差距，强调了以知识为中心的基准来推动智能图像编辑系统的开发。

通过用户界面分解和合成来缩放计算机使用接地

标题: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
作者: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
日期: 2025-05-19
ArXiv主页: https://arxiv.org/abs/2505.13227
论文链接: https://arxiv.org/pdf/2505.13227
项目链接: https://osworld-grounding.github.io/
gitHub仓库: https://github.com/xlang-ai/OSWorld-G

英文摘要

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

中文摘要

图形用户界面（GUI）接地，将自然语言指令映射到图形用户界面的特定操作的能力仍然是计算机使用代理开发中的关键瓶颈。当前的基准测试过度简化接地任务是简短的参考表达式，未能捕获需要软件常识，布局理解和精细元素操纵能力的现实世界交互的复杂性。为了解决这些限制，我们介绍了OSWorld-G，这是一个综合基准，包括564个跨不同任务类型的精细注释样本，包括文本匹配，元素识别，布局理解和精确的操作。此外，我们合成并释放最大的计算机使用地接地数据集绝地武士，该数据集包含400万个示例，该示例通过多个任务的脱钩。我们在绝地训练的多尺度模型通过在ScreensPot-V2，ScreenSpot-Pro和我们的OSWorld-G上的现有方法表现出了其有效性。此外，我们证明，通过绝地武士进行的改进的接地直接增强了对复杂计算机任务的一般基础模型的代理能力，在OSWORLD上从5％提高到27％。通过详细的消融研究，我们确定了有助于接地性能的关键因素，并验证将不同界面元素的专业数据结合起来，可以使组合概括为新的接口。所有基准，数据，检查点和代码都是开源的，可在https://osworld-grounding.github.io上找到。

有效的计算机使用代理培训

标题: Efficient Agent Training for Computer Use
作者: Yanheng He, Jiahe Jin, Pengfei Liu
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.13909
论文链接: https://arxiv.org/pdf/2505.13909
项目链接: https://gair-nlp.github.io/PC-Agent-E/
gitHub仓库: https://github.com/GAIR-NLP/PC-Agent-E

英文摘要

Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.

中文摘要

长期以来，扩大高质量的轨迹数据一直是开发类似人类的计算机使用剂的关键瓶颈。我们介绍了PC Agent-E，这是一个有效的代理训练框架，可显着降低对大型人类示范的依赖。从仅312个人类注销的计算机使用轨迹开始，我们通过用Claude 3.7十四行诗综合各种动作决策进一步提高了数据质量。经过对这些丰富的轨迹的培训，我们的PC代理-E模型获得了141％的相对改进，超过了强大的Claude 3.7十四行诗，对Windowsagentarena-V2进行了广泛的思考，我们也发布了改进的基准。此外，PC Agent-E在OSWorld上表现出对不同操作系统的强概括性。我们的发现表明，可以从少量高质量的轨迹数据中刺激强大的计算机使用功能。

QuickVideo：使用系统算法共同设计的实时视频理解

标题: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
作者: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen
日期: 2025-05-22
ArXiv主页: https://arxiv.org/abs/2505.16175
论文链接: https://arxiv.org/pdf/2505.16175
gitHub仓库: https://github.com/TIGER-AI-Lab/QuickVideo

英文摘要

Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.

中文摘要

在现实世界中，诸如视频监视，汇总，教育讲座分析和体育广播等现实世界应用中，长期视频理解已成为至关重要的能力。但是，它仍然对视频集进行计算过敏，这主要是由于两个瓶颈：1）连续的视频解码，将原始位流转换为RGB框架的过程可能需要长达一分钟的时间来获得一个小时的视频输入，而2）昂贵的预填充llm pressprience for LLM的优先级别，可用于高llm latence and Memornence and Memornenty和Memordententy and Memornenty和Memordententy。为了应对这些挑战，我们提出了QuickVideo，这是一种系统 - 算法共同设计，可实质上加速了长期视频的理解，以支持下游应用程序。它包括三个关键创新：QuickDecoder，这是一种基于CPU的基于CPU的视频解码器，通过将视频分配到同时处理的密钥帧对齐间隔中，从而实现了2-3倍的速度；QuickPrefill，一种使用KV-CACHE修剪的内存效率预填充方法，用于支撑更多GPU内存的帧；以及一个重叠的方案，该方案与GPU推理重叠的CPU视频解码。这些组件的地狱时间在长时间的视频输入上减少了一分钟，即使在有限的硬件上也可以可扩展，高质量的视频理解。实验表明，QuickVideo在跨持续时间和采样率之间概括，从而使长期视频处理在实践中可行。

这次是不同的：时间序列基础模型的可观察性观点

标题: This Time is Different: An Observability Perspective on Time Series Foundation Models
作者: Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, David Asker, Ameet Talwalkar, Othmane Abou-Amal
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14766
论文链接: https://arxiv.org/pdf/2505.14766
项目链接: https://www.datadoghq.com/blog/ai/toto-boom-unleashed/
gitHub仓库: https://github.com/DataDog/toto

英文摘要

We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto’s pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10times larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog’s own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto’s model weights, inference code, and evaluation scripts, as well as BOOM’s data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.

中文摘要

我们介绍Toto，这是一个时间序列预测基础模型，具有1.51亿个参数。Toto使用仅现代解码器的体系结构，再加上建筑创新，旨在解决多元观察性时间序列数据中发现的特定挑战。Toto的预培训语料库是可观察性数据，打开数据集和合成数据的混合物，并且比领先时间序列基础模型的混合物大4-10倍。此外，我们引入了Boom，这是一个大规模的基准测试，其中包括2807个现实世界中的3.5亿观测。对于TOTO和BOOM，我们专门从Datadog自己的遥测和内部可观察性指标中获取可观察性数据。广泛的评估表明，Toto在繁荣和既定的通用时间序列预测基准上都能达到最先进的表现。Toto的模型权重，推理代码和评估脚本以及Boom的数据和评估代码都可以在Apache 2.0许可下作为开源，可在https://huggingface.co/datadog/toto-open-base-base-base-1.0和https://github.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com.com/datadog/toto./tototo中获得。

模型合并大型语言模型的预培训

标题: Model Merging in Pre-training of Large Language Models
作者: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Liang Xiang, Yonghui Wu
日期: 2025-05-17
ArXiv主页: https://arxiv.org/abs/2505.12082
论文链接: https://arxiv.org/pdf/2505.12082

英文摘要

Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

中文摘要

模型合并已成为增强大型语言模型的一种有希望的技术，尽管它在大规模的预训练中的应用仍然相对尚未探索。在本文中，我们对训练过程中的模型合并技术进行了全面研究。通过对量表和混合物（MOE）的广泛实验，范围从数百万到1000亿多个参数，我们证明，与持续学习率进行训练的检查点不仅可以实现显着的绩效提高，而且还可以准确预测退火行为。这些改进会导致更有效的模型开发，并显着降低培训成本。我们关于合并策略和超参数合并的详细消融研究提供了对基本机制的新见解，同时揭示了新的应用。通过全面的实验分析，我们为有效模型合并提供了开源社区实践预训练指南。

奖励推理模型

标题: Reward Reasoning Model
作者: Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
日期: 2025-05-20
ArXiv主页: https://arxiv.org/abs/2505.14674
论文链接: https://arxiv.org/pdf/2505.14674

英文摘要

Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.

中文摘要

奖励模型在指导大型语言模型方面起着与人类期望相符的输出的关键作用。但是，有效利用测试时间计算来增强奖励模型性能，仍有一个开放的挑战。在这项工作中，我们引入了奖励推理模型（RRMS），这些模型是专门设计的，目的是在产生最终奖励之前执行故意的推理过程。通过经过思考的推理，RRMS在没有立即明显的情况下，在适当的奖励的情况下利用额外的测试时间计算来进行复杂的查询。为了开发RRMS，我们实施了一个强化学习框架，该框架可以培养自发展的奖励推理能力，而无需明确的推理痕迹作为培训数据。实验结果表明，RRMS在跨不同领域的奖励建模基准上取得了卓越的性能。值得注意的是，我们表明RRM可以适应利用测试时间计算以进一步提高奖励准确性。验证的奖励推理模型可在https://huggingface.co/reward-reasoning上找到。

更快的视频扩散，可训练的稀疏关注

标题: Faster Video Diffusion with Trainable Sparse Attention
作者: Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang
日期: 2025-05-19
ArXiv主页: https://arxiv.org/abs/2505.13389
论文链接: https://arxiv.org/pdf/2505.13389
gitHub仓库: https://github.com/hao-ai-lab/FastVideo

英文摘要

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at both training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight critical tokens; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53times with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6times and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.

中文摘要

缩放视频扩散变压器（DIT）受其二次3D注意的限制，即使大多数注意力集中在一小部分位置上。我们将这种观察结果变成了VSA，这是一种可训练，高效的稀疏注意力，可以取代训练和推理时的全部关注。在VSA中，一个轻巧的粗舞台池将图表纳入瓷砖，并识别出高重量的关键令牌。一个精细的阶段仅在受到阻止计算布局的那些瓷砖内计算令牌级别的关注，以确保硬效率。这导致了一个单个可区分的内核，该内核端到端训练不需要事后分析，并且维持85 \％的flashattention3 MFU。我们通过从60m到1.4B参数进行预处理进行大量的消融研究和缩放法实验。VSA达到了一个帕累托点，将训练拖鞋削减了2.53摄氏度，而扩散损失没有下降。对开源WAN-2.1进行改装，将注意力时间速度增加了6times，并以相当的质量从31S到18S降低了端到端的生成时间。这些结果建立了可训练的稀疏关注，作为全部关注的实际替代方法，并是进一步扩散模型的关键推动力。

查看全文

http://www.dtcms.com/a/393621.html