【论文速递】2025年第22周(May-25-31)(Robotics/Embodied AI/LLM)
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- mutarjim:用小语言模型推进双向阿拉伯语 - 英语翻译
- 英文摘要
- 中文摘要
- 将AI效率从以模型为中心转移到以数据为中心的压缩
- 英文摘要
- 中文摘要
- 推理语言模型的加固学习的熵机制
- 英文摘要
- 中文摘要
- TABSTAR:具有语义目标感知表示的基础表格模型
- 英文摘要
- 中文摘要
- Paper2Poster:迈向科学论文的多模式海报自动化
- 英文摘要
- 中文摘要
- 科学板:评估现实科学工作流程中的多模式自治代理
- 英文摘要
- 中文摘要
- Table-R1:表推理推理时间缩放
- 英文摘要
- 中文摘要
- SWE-REBENCH:一条自动化管道,用于收集任务和污染的软件工程代理评估
- 英文摘要
- 中文摘要
- Qwenlong-l1:通过增强学习的长篇小说大型推理模型
- 英文摘要
- 中文摘要
- MME-ROUNTIS:MLLM中逻辑推理的综合基准
- 英文摘要
- 中文摘要
- 炼金术士:将公共文字图数据转换为生成黄金
- 英文摘要
- 中文摘要
- 通过检索和代码工具将LLM代理蒸馏成小型型号
- 英文摘要
- 中文摘要
- 四重奏:本地FP4培训对于大语模型最佳
- 英文摘要
- 中文摘要
- R2R:通过小型模型令牌路由有效地导航发散的推理路径
- 英文摘要
- 中文摘要
- 空间MLLM:在基于视觉的空间智能中提高MLLM功能
- 英文摘要
- 中文摘要
- 合成:合成可验证的推理数据,以学习逻辑推理及其他
- 英文摘要
- 中文摘要
- 攀登比峰会更深的智慧:在学习推理的嘈杂奖励上
- 英文摘要
- 中文摘要
- OMNICOSSISTENCE:学习风格的一致性来自配对的风格数据
- 英文摘要
- 中文摘要
- 推理模型是固执的:诊断推理模型中的指令覆盖
- 英文摘要
- 中文摘要
- Bizfinbench:以商业为导向的现实世界的财务基准评估LLMS
- 英文摘要
- 中文摘要
- 探索一步文本生成LLM的潜在能力
- 英文摘要
- 中文摘要
- 一个RL可以看到所有内容:视觉三重统一加强学习
- 英文摘要
- 中文摘要
- 不要过度思考。更喜欢较短的思维链来改善LLM推理
- 英文摘要
- 中文摘要
- VF-eval:评估用于在AIGC视频上生成反馈的多模式LLM
- 英文摘要
- 中文摘要
- Skywork开放推理1技术报告
- 英文摘要
- 中文摘要
mutarjim:用小语言模型推进双向阿拉伯语 - 英语翻译
- 标题: Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
- 作者: Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
- 日期: 2025-05-23
- ArXiv主页: https://arxiv.org/abs/2505.17894
- 论文链接: https://arxiv.org/pdf/2505.17894
英文摘要
We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus… Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
中文摘要
我们介绍了Mutarjim,这是一种紧凑而强大的语言模型,用于双向阿拉伯语英语翻译。尽管大规模的LLM在自然语言处理任务(包括机器翻译,较小的型号)中显示出令人印象深刻的进展。利用这种见解,我们根据Kuwain-1.5b开发了Mutarjim,这是一种针对阿拉伯语和英语量身定制的语言模型。尽管大小适中,但Mutarjim在几个既定的基准测试基准上都胜过更大的模型,这是通过优化的两阶段训练方法和精心策划的高质量训练语料库实现的。实验结果表明,Mutarjim竞争对手模型最大20倍,而显着降低了计算成本和培训要求。我们还介绍了Tarjama-25,这是一种新的基准测试,旨在克服现有的阿拉伯语英语基准数据集的局限性,例如域狭窄,短句子长度和英语源偏见。Tarjama-25包括5,000个经过专家评审的句子对,并跨越了广泛的领域,提供了更全面和平衡的评估框架。值得注意的是,Mutarjim在Tarjama-25的英语到阿拉伯任务上取得了最先进的表现,超过了GPT-4O MINI(例如GPT-4O Mini)的更大且专有的模型。我们公开发布Tarjama-25,以支持未来的研究并推进对阿拉伯语英语翻译系统的评估。
将AI效率从以模型为中心转移到以数据为中心的压缩
- 标题: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
- 作者: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
- 日期: 2025-05-25
- ArXiv主页: https://arxiv.org/abs/2505.19147
- 论文链接: https://arxiv.org/pdf/2505.19147
- 项目链接: https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression
- gitHub仓库: https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression
英文摘要
The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community’s advancement.
中文摘要
大型语言模型(LLMS)和多模式LLMS(MLLM)的快速发展历史上依赖于以模型为中心的缩放,通过将参数数量从数百万增加到数十亿美元增加到数十亿美元来推动绩效提高。但是,随着我们对模型大小的硬件限制,主要的计算瓶颈从根本上转移到了长期令牌序列上自我注意的二次成本,现在由超长的文本上下文,高分辨率图像和扩展视频驱动。在该立场论文中,我们认为研究的重点是有效AI的重点正在从以模型为中心的压缩到以数据为中心的压缩。我们将令牌压缩定位为新的边界,它通过减少模型训练或推断期间的代币数量来提高AI效率。通过全面的分析,我们首先研究了各个领域的长篇文化AI的最新发展,并为现有的模型效率策略建立了统一的数学框架,证明了为什么令牌压缩为何代表了解决长篇小说台面的关键范式转移。随后,我们系统地回顾了令牌压缩的研究格局,分析其基本收益,并确定其在各种情况下具有引人注目的优势。此外,我们对令牌压缩研究和概述有希望的未来方向的当前挑战提供了深入的分析。最终,我们的工作旨在提供有关AI效率,综合现有研究并促进创新发展的新观点,以应对增加上下文长度对AI社区发展的挑战。
推理语言模型的加固学习的熵机制
-
标题: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
-
作者: Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
-
日期: 2025-05-28
-
ArXiv主页: https://arxiv.org/abs/2505.22617
-
论文链接: https://arxiv.org/pdf/2505.22617
-
gitHub仓库: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL
英文摘要
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
中文摘要
本文旨在克服与LLM推理RL的主要障碍,即政策熵的崩溃。在没有熵干预的情况下,一贯观察到这种现象,在早期训练阶段,政策熵急剧下降,这种探索能力的降低总是伴随着政策绩效的饱和。在实践中,我们建立了一个转换方程r = -a *e^h+b在熵H和下游性能之间。这种经验定律强烈表明,政策绩效是从政策熵上交易的,因此由于其疲惫而瓶颈,因此天花板是完全可预测的H = 0,r = -a+b。我们的发现需要熵管理,以持续探索RL的缩放计算。为此,我们从理论和经验上研究熵动态。我们的派生强调,策略熵的变化是由动作概率和逻辑变化之间的协方差驱动的,这与使用类似策略梯度的算法时与其优势成正比。经验研究表明,协方差项和熵差的值完全匹配,以支持理论结论。此外,协方差术语在整个培训中大部分保持积极,进一步解释了为什么政策熵会单调减少。通过了解熵动态背后的机制,我们通过限制高搭配令令牌的更新来激励控制熵。具体而言,我们提出了两种简单但有效的技术,即夹具-COV和KL-COV,它们分别对具有较高协方差的代币进行了剪辑和KL罚款。实验表明,这些方法鼓励探索,从而有助于政策逃脱熵崩溃并取得更好的下游性能。
TABSTAR:具有语义目标感知表示的基础表格模型
- 标题: TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
- 作者: Alan Arazi, Eilam Shapira, Roi Reichart
- 日期: 2025-05-23
- ArXiv主页: https://arxiv.org/abs/2505.18125
- 论文链接: https://arxiv.org/pdf/2505.18125
- 项目链接: https://eilamshapira.com/TabSTAR
英文摘要
While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.
中文摘要
尽管深度学习在许多领域都取得了巨大的成功,但历史上,它在表格学习任务上的表现不佳,这些任务仍然受到梯度增强决策树(GBDTS)的控制。但是,最近的进步是为表格基础模型铺平了道路,该模型可以利用现实世界的知识并在各种数据集中概括,尤其是当数据包含自由文本时。尽管已经探索了将语言模型功能纳入表格任务,但大多数现有方法都利用静态的,目标 - 反应的文本表示形式,从而限制了它们的有效性。我们介绍了TABSTAR:具有语义上目标表示表示的基础表格模型。TABSTAR旨在启用具有文本功能的表格数据的传输学习,并免费提供数据集特定参数。它解开了预贴的文本编码器,并将其作为输入目标令牌,为模型提供了学习特定于任务的嵌入所需的上下文。TABSTAR在已知的具有文本功能的分类任务的基准中实现了中型和大型数据集的最先进性能,其预科阶段在数据集中显示了缩放定律,为进一步的性能改进提供了途径。
Paper2Poster:迈向科学论文的多模式海报自动化
- 标题: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
- 作者: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
- 日期: 2025-05-27
- ArXiv主页: https://arxiv.org/abs/2505.21497
- 论文链接: https://arxiv.org/pdf/2505.21497
- 项目链接: https://paper2poster.github.io/
- gitHub仓库: https://github.com/Paper2Poster/Paper2Poster
英文摘要
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the ©Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
中文摘要
学术海报的生成是科学交流中的一项至关重要但又具有挑战性的任务,需要将长期文化的交织文档压缩为一个视觉连贯的页面。To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably(iv)PaperQuiz - 海报传达核心纸质内容的能力,通过VLMS回答生成的测验。在此基准测试的基础上,我们提出了后向上的,视觉上的多机构管道:(a)解析器将纸提炼成一个结构化的资产库;(b)规划师将文本 - 视觉对将二进制的布局保持一致,以保留阅读顺序和空间平衡;(c)画家 - 征服者循环通过执行渲染代码并使用VLM反馈来消除溢出并确保对齐,从而完善了每个面板。在我们的全面评估中,我们发现GPT-4O输出 - 尽管在乍一看,但在视觉上以嘈杂的文本和较差的纸质Quiz分数具有吸引力,并且我们发现读者参与度是主要的美学瓶颈,因为人类设计的海报在很大程度上依赖于视觉语义来传达意义。我们的完全开源变体(例如,基于QWEN-2.5系列)在几乎所有指标上都优于现有的4O驱动多代理系统,同时使用的令牌少了87%。它将22页的纸转换为最终确定但可编辑的.pptx海报 - 全部仅$ 0.005。这些发现图表了下一代全自动海报生成模型的明确方向。该代码和数据集可在https://github.com/paper2poster/paper2poster上找到。
科学板:评估现实科学工作流程中的多模式自治代理
- 标题: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
- 作者: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
- 日期: 2025-05-26
- ArXiv主页: https://arxiv.org/abs/2505.19897
- 论文链接: https://arxiv.org/pdf/2505.19897
- 项目链接: https://qiushisun.github.io/ScienceBoard-Home/
- gitHub仓库: https://github.com/OS-Copilot/ScienceBoard
英文摘要
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers’ workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
中文摘要
大型语言模型(LLM)扩大了其影响,超出了自然语言处理,从而大大促进了跨学科研究的发展。最近,已经开发了各种基于LLM的代理,以帮助跨多个方面和领域的科学发现进展。其中,能够像人类一样与操作系统进行交互的计算机代理正在为自动化科学问题解决方案铺平道路,并解决研究人员工作流程中的常规。认识到这些代理的变革性潜力,我们介绍了科学板,其中包括两个互补的贡献:(i)一个现实的,多域的环境,具有具有动态和视觉上丰富的科学工作流,其具有集成的专业软件,其中代理可以通过不同的界面自动交互,以加速复杂的研究任务和实验;(ii)由人类策划的169个高质量,严格验证的现实世界任务的具有挑战性的基准,涵盖了诸如生物化学,天文学和地球信息学之类的领域的科学发现工作流程。对具有最先进的骨架(例如GPT-4O,Claude 3.7,UI-TARS)的代理的广泛评估表明,尽管有一些有希望的结果,但它们仍然没有可靠地帮助科学家进行复杂的工作流程,仅取得了15%的整体成功率。深入的分析进一步提供了解决当前代理限制和更有效的设计原则的宝贵见解,为建立更有能力的代理以进行科学发现铺平了道路。我们的代码,环境和基准在https://qiushisun.github.io/scienceboard-home/上。
Table-R1:表推理推理时间缩放
-
标题: Table-R1: Inference-Time Scaling for Table Reasoning
-
作者: Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
-
日期: 2025-05-29
-
ArXiv主页: https://arxiv.org/abs/2505.23621
-
论文链接: https://arxiv.org/pdf/2505.23621
-
gitHub仓库: https://github.com/Table-R1/Table-R1
英文摘要
In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
中文摘要
在这项工作中,我们介绍了第一项研究,以探讨表上的推理推理任务的推理时间扩展。我们制定并评估两种训练后策略,以实现推理时间缩放:从前沿模型推理痕迹和通过可验证的奖励(RLVR)进行蒸馏。为了进行蒸馏,我们引入了DeepSeek-R1生成的推理痕迹的大规模数据集,我们将其用于将LLMS微调到Table-R1-SFT模型中。对于RLVR,我们提出了特定于任务的可验证奖励函数,并应用GRPO算法以获得Table-R1-Zero模型。我们评估了跨不同表推理任务的Table-r1系列模型,包括短形式质量质量质量检查,事实验证和自由形式QA。值得注意的是,Table-R1-Zero模型匹配或超过GPT-4.1和DeepSeek-R1的性能,同时仅使用7B参数LLM。它还证明了对室外数据集的强烈概括。广泛的消融和定性分析揭示了教学调整,模型体系结构选择和交叉任务概括的好处,以及在RL培训期间出现必要的桌面推理技能。
SWE-REBENCH:一条自动化管道,用于收集任务和污染的软件工程代理评估
- 标题: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
- 作者: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
- 日期: 2025-05-26
- ArXiv主页: https://arxiv.org/abs/2505.20411
- 论文链接: https://arxiv.org/pdf/2505.20411
英文摘要
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
中文摘要
基于LLM的代理商在越来越多的软件工程(SWE)任务中显示出有希望的功能。但是,发展这一领域面临两个关键挑战。首先,高质量的培训数据是稀缺的,尤其是反映了现实世界中SWE场景的数据,在该数据中,代理必须根据其行动的结果与开发环境相互作用,执行代码和适应行为。现有的数据集仅限于单发代码生成,或者包含小的,手动策划的交互式任务集合,缺乏规模和多样性。其次,缺乏新的交互式SWE任务会影响对迅速改进模型的评估,因为由于污染问题,静态基准很快就过时了。为了解决这些局限性,我们引入了一种新颖,自动化和可扩展的管道,以不断从不同的GitHub存储库中提取现实世界中的交互式SWE任务。使用此管道,我们构建了一个SWE-Rebench,这是一个公共数据集,其中包括21,000多个基于Python的SWE任务,适合按大规模增强SWE代理的增强学习。此外,我们使用使用SWE-REBENCH方法收集的新任务的持续供应来为代理软件工程构建无污染的基准测试。我们将此基准上的各种LLM的结果与经过验证的SWE基础上的结果进行了比较,并表明某些语言模型的性能可能由于污染问题而被夸大。
Qwenlong-l1:通过增强学习的长篇小说大型推理模型
-
标题: QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
-
作者: Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
-
日期: 2025-05-23
-
ArXiv主页: https://arxiv.org/abs/2505.17667
-
论文链接: https://arxiv.org/pdf/2505.17667
-
gitHub仓库: https://github.com/Tongyi-Zhiwen/QwenLong-L1
英文摘要
Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.
中文摘要
最近的大型推理模型(LRMS)通过增强学习(RL)表现出强大的推理能力。这些改进主要是在短上下文推理任务中观察到的。相比之下,扩展LRM的有效处理和通过RL上的长篇文化输入的理由仍然是一个关键的未解决挑战。为了弥合这一差距,我们首先正式化了长篇小说推理RL的范式,并确定次优训练效率和不稳定优化过程中的关键挑战。为了解决这些问题,我们提出了Qwenlong-l1,该框架通过渐进式上下文缩放来适应缩写LRMS的长篇小写方案。具体而言,我们利用热门监督的微调(SFT)阶段来建立强大的初始政策,然后进行课程引导的逐步阶段RL技术来稳定政策的演变,并通过难以吸引的回顾性抽样策略来增强,以激励探索政策。七个长篇文档的实验提问基准测试表明,Qwenlong-L1-32B在Openai-O3-Mini和Qwen3-235b-A22b等旗舰LRM上的表现,在Claude-3.7-3.7-Sonnet-Thinking中取得了领先的状态表现,表现出领先的状态表现,表现出来,表现出来,表现出来,表现出来。这项工作推动了能够在信息密集型环境中进行稳健推理的实用长篇小说LRM的开发。
MME-ROUNTIS:MLLM中逻辑推理的综合基准
- 标题: MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
- 作者: Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue
- 日期: 2025-05-27
- ArXiv主页: https://arxiv.org/abs/2505.21327
- 论文链接: https://arxiv.org/pdf/2505.21327
- 项目链接: https://alpha-innovator.github.io/mmereasoning.github.io/#leaderboard
- gitHub仓库: https://github.com/Alpha-Innovator/MME-Reasoning
英文摘要
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode’’ and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.
中文摘要
逻辑推理是人类智能的基本方面,也是多模式大语模型(MLLM)的重要能力。尽管多模式推理取得了重大进步,但由于缺乏对逻辑推理类型的明确分类和对推理的不清楚的理解,现有的基准无法全面评估其推理能力。为了解决这些问题,我们介绍了MME-Remounting,这是一种旨在评估MLLM的推理能力的全面基准,该基准涵盖了所有三种类型的推理(即归纳,演绎和绑架)的问题。我们仔细策划数据,以确保每个问题有效地评估推理能力,而不是感知技能或知识广度,并扩展评估协议以涵盖对不同问题的评估。我们的评估揭示了对逻辑推理能力进行整体评估时,最先进的MLLM的实质性局限性。即使是最先进的MLLM,在全面的逻辑推理方面表现有限,其推理类型的性能失衡显着。此外,我们对诸如``思维模式’'和基于规则的RL等方法进行了深入的分析,这些方法通常被认为可以增强推理能力。这些发现突出了当前MLLM在各种逻辑推理方案中的关键局限性和性能失衡,从而为理解和评估推理能力提供了全面和系统的见解。
炼金术士:将公共文字图数据转换为生成黄金
- 标题: Alchemist: Turning Public Text-to-Image Data into Generative Gold
- 作者: Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin
- 日期: 2025-05-25
- ArXiv主页: https://arxiv.org/abs/2505.19297
- 论文链接: https://arxiv.org/pdf/2505.19297
英文摘要
Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models’ weights to the public.
中文摘要
训练预训练将文本到图像(T2I)模型具有广泛的世界知识,但仅此模型通常不足以实现高审美质量和一致性。因此,监督的微调(SFT)对于进一步的完善至关重要。但是,其有效性在很大程度上取决于微调数据集的质量。现有的公共SFT数据集经常针对狭窄的域(例如动漫或特定艺术风格),而创建高质量的通用SFT数据集仍然是一个重大挑战。当前的策展方法通常是昂贵的,并且难以确定真正有影响力的样本。公共通用数据集的稀缺性使这一挑战更加复杂,因为领先的模型通常依赖于大型,专有且记录不足的内部数据,从而阻碍了更广泛的研究进度。本文介绍了一种新的方法,用于创建通用SFT数据集,通过利用预先训练的生成模型作为高影响力训练样本的估计器。我们将此方法应用于构建和释放炼金术士,这是一种紧凑型(3,350个样本)但高效的SFT数据集。实验表明,炼金术士大大提高了五种公共T2I模型的生成质量,同时保留了多样性和风格。此外,我们向公众发布了微调模型的权重。
通过检索和代码工具将LLM代理蒸馏成小型型号
-
标题: Distilling LLM Agent into Small Models with Retrieval and Code Tools
-
作者: Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang
-
日期: 2025-05-23
-
ArXiv主页: https://arxiv.org/abs/2505.17612
-
论文链接: https://arxiv.org/pdf/2505.17612
-
gitHub仓库: https://github.com/ThomasVuNguyen/agent-distillation
英文摘要
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.
中文摘要
大型语言模型(LLM)在复杂的推理任务上表现出色,但在计算上保持昂贵,从而限制了其实际部署。为了解决这个问题,最近的作品集中在使用教师LLMS的Theark(COT)痕迹将推理能力提炼成较小的语言模型(SLM)。但是,这种方法在需要罕见的事实知识或精确计算的情况下挣扎,在这种情况下,由于能力有限,SLM经常会幻觉。在这项工作中,我们提出了代理蒸馏,这是一个不仅要转移推理能力,而且将完整的任务解决行为从基于LLM的代理转移到具有检索和代码工具的SLMS的框架。我们改善了沿两个互补轴的剂蒸馏:(1)我们引入了一种称为第一三次前缀的提示方法,以提高教师生成的轨迹的质量;(2)我们提出了一个自洽的动作生成,以改善小型代理的测试时间鲁棒性。我们在跨事实和数学领域的八个推理任务上评估了我们的方法,涵盖了内域和跨域的概括。我们的结果表明,SLM小至0.5B,1.5B,3B参数可以通过使用COT蒸馏微调的下一层较大的1.5b,3b,7b 7b模型来实现性能竞争,这表明了代理蒸馏以构建实用的,使用工具的小型代理的潜力。我们的代码可在https://github.com/nardien/agent-distillation上找到。
四重奏:本地FP4培训对于大语模型最佳
-
标题: Quartet: Native FP4 Training Can Be Optimal for Large Language Models
-
作者: Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
-
日期: 2025-05-20
-
ArXiv主页: https://arxiv.org/abs/2505.14669
-
论文链接: https://arxiv.org/pdf/2505.14669
-
gitHub仓库: https://github.com/IST-DASLab/Quartet
英文摘要
The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA’s recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a “near-optimal” low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
中文摘要
大型语言模型(LLM)的快速发展与计算需求前所未有的增加相似,每隔几个月的最先进模型的培训成本增加了一倍。通过提高计算吞吐量和能源效率,直接在低精度算术中的训练模型提供了解决方案。具体而言,NVIDIA最近的Blackwell体系结构促进了极低精确的操作,特别是FP4变体,有望获得实质性的提高。然而,当前用于FP4精确训练LLM的算法面临明显的准确性降解,并且通常依赖于混合精确的后备。在本文中,我们系统地研究了硬件支持的FP4训练,并引入四重奏,这是一种新的方法,可以通过低精度执行所有主要计算(例如线性层),以实现准确的端到端FP4培训。通过对美洲驼模型的广泛评估,我们揭示了一种新的低精度缩放定律,可以量化各个不同的位宽度的性能权衡,并使我们能够在“准确性VS-Compant”(称为Quartet)方面识别一种“近乎最佳的”低精度培训技术。我们使用针对NVIDIA Blackwell GPU量身定制的优化CUDA内核实施四重奏,并表明它可以实现FP4精度,成功训练十亿个尺度模型的最新精度。我们的方法表明,完全基于FP4的培训是标准前期和FP8培训的竞争替代方法。我们的代码可在https://github.com/ist-daslab/quartet上找到。
R2R:通过小型模型令牌路由有效地导航发散的推理路径
- 标题: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
- 作者: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
- 日期: 2025-05-27
- ArXiv主页: https://arxiv.org/abs/2505.21600
- 论文链接: https://arxiv.org/pdf/2505.21600
- 项目链接: https://fuvty.github.io/R2R_Project_Page/
- gitHub仓库: https://github.com/thu-nics/R2R
英文摘要
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs’ reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
中文摘要
大型语言模型(LLMS)以大量推断的开销为代价实现了令人印象深刻的推理能力,从而带来了重大的部署挑战。尽管蒸馏的小语言模型(SLM)显着提高了效率,但由于他们未能遵循LLMS的推理路径而遭受的效率。幸运的是,我们透露,只有一小部分令牌在LLMS和SLMS之间真正散开的推理路径。大多数产生的令牌是相同的或表现出中性差异的,例如缩写或表达式的较小变化。利用这种见解,我们将**的道路介绍给罗马(R2R)**,这是一种神经代币路由方法,仅适用于这些关键的,路径不良的代币,而将大部分令牌产生给SLM。我们还开发了一个自动数据生成管道,该管道可以识别发散令牌并生成令牌级的路由标签以训练轻量级路由器。我们使用R2R来组合DeepSeek家族的R1-1.5B和R1-32B模型,并评估具有挑战性的数学,编码和QA基准。R2R的平均激活参数大小为5.6B,超过R1-7B的平均准确性1.6倍,甚至超过R1-14B模型。与R1-32B相比,它提供了2.8倍的壁挂式加速度,具有可比的性能,从而提高了测试时间缩放效率的帕累托前沿。我们的代码可在https://github.com/thu-nics/r2r上找到。
空间MLLM:在基于视觉的空间智能中提高MLLM功能
- 标题: Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
- 作者: Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan
- 日期: 2025-05-29
- ArXiv主页: https://arxiv.org/abs/2505.23747
- 论文链接: https://arxiv.org/pdf/2505.23747
- 项目链接: https://diankun-wu.github.io/Spatial-MLLM/
- gitHub仓库: https://github.com/diankun-wu/Spatial-MLLM
英文摘要
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.
中文摘要
多模式大语言模型(MLLM)的最新进步在2D视觉任务上的性能显着提高。但是,改善其空间情报仍然是一个挑战。现有的3D MLLM始终依靠其他3D或2.5D数据来包含空间意识,从而在仅使用2D输入(例如图像或视频)的情况下限制了它们的实用性。在本文中,我们提出了空间 - 摩尔姆,这是一个纯粹是2D观测的基于视觉空间推理的新型框架。与传统的视频MLLM不同,它依靠基于夹子的视觉编码器进行了优化的语义理解,我们的关键见解是将强大的结构从前馈视觉几何基础模型中释放出来。具体而言,我们提出了一个双重编码器结构:预审预测的2D视觉编码器来提取语义特征,并从视觉几何模型的主链中启用了空间编码器,可从视觉几何模型到提取3D结构特征。然后,连接器将这两个功能集成到统一的视觉令牌中,以增强空间理解。此外,我们在推理时提出了一种空间感知的框架采样策略,该策略选择了视频序列的空间信息帧,以确保即使在有限的令牌长度下,该模型都集中在空间推理至关重要的框架上。除了建筑改进之外,我们还构建了空间-MLLM-1220K数据集,并使用监督的微调和GRPO对其进行训练。在各种现实世界数据集上进行的广泛实验表明,我们的空间MLLM在广泛的基于视觉的空间理解和推理任务中实现了最先进的性能。项目页面:https://diankun-wu.github.io/spatial-mllm/。
合成:合成可验证的推理数据,以学习逻辑推理及其他
- 标题: SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
- 作者: Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He
- 日期: 2025-05-26
- ArXiv主页: https://arxiv.org/abs/2505.19641
- 论文链接: https://arxiv.org/pdf/2505.19641
- 项目链接: https://huggingface.co/datasets/MiniMaxAI/SynLogic
- gitHub仓库: https://github.com/MiniMax-AI/SynLogic
英文摘要
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.
中文摘要
诸如OpenAI-O1和DeepSeek R1之类的最新进展已经证明了增强学习的潜力(RL)增强了大语言模型(LLMS)中推理能力的潜力。虽然开源复制工作主要集中在数学和编码域上,但开发一般推理能力的方法和资源仍未得到充实。该差距部分是由于收集适合RL的多样化和可验证的推理数据的挑战。我们假设逻辑推理对于发展一般推理能力至关重要,因为逻辑构成了推理的基本基础。在这项工作中,我们提出了Synlogic,这是一个数据综合框架和数据集,该数据集以大规模生成不同的逻辑推理数据,涵盖了35个不同的逻辑推理任务。合成方法可以以可调节的难度和数量来控制数据的综合。重要的是,所有示例都可以通过简单的规则来验证,使其非常适合具有可验证的奖励的RL。在我们的实验中,我们验证了基于7B和32B模型对Synlogic数据集的RL训练的有效性。合成器会导致开源数据集之间的最新逻辑推理性能,在BBEH上超过了DeepSeek-R1-Distill-qwen-32b 6分。此外,将同步数据与数学和编码任务混合在一起,提高了这些领域的训练效率,并显着增强了推理的概括。值得注意的是,我们的混合培训模型优于多个基准测试的DeepSeek-R1-Zero-Qwen-32b。这些发现的位置合格是推进LLMS更广泛的推理能力的宝贵资源。我们在https://github.com/minimax-ai/synlogic上开放数据合成管道和Synlogic数据集。
攀登比峰会更深的智慧:在学习推理的嘈杂奖励上
-
标题: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
-
作者: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
-
日期: 2025-05-28
-
ArXiv主页: https://arxiv.org/abs/2505.22653
-
论文链接: https://arxiv.org/pdf/2505.22653
-
gitHub仓库: https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason
英文摘要
Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to’‘-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM’s performance on open-ended tasks. These findings suggest the importance of improving models’ foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.
中文摘要
关于培训后大语言模型(LLM)的最新研究通过加强学习(RL)进行推理(RL)通常关注可以准确验证和奖励的任务,例如解决数学问题。相比之下,我们的研究调查了奖励噪声的影响,这是对使用奖励模型涉及LLMS培训后的现实世界情景的更实际考虑。我们发现LLMS对实质性奖励噪声表现出强大的鲁棒性。例如,手动翻转奖励功能在数学任务中的40%的输出仍然允许QWEN-2.5-7B模型实现快速收敛,而将其在数学任务的绩效从5%提高到72%,而通过训练有嘈杂的无噪声奖励的模型获得的75%精度。令人惊讶的是,通过仅奖励关键推理短语的出现(即推理模式奖励,RPR),例如``首先,我需要’’ - 而不验证答案的正确性,该模型可实现下游性能的峰值(QWEN-2.5-7B的70%精度),可与训练有素的型号具有严格的正确性验证和准确的验证训练的型号。认识到推理过程对最终结果的重要性,我们将RPR与嘈杂的奖励模型相结合。RPR帮助校准了嘈杂的奖励模型,减轻潜在的假否定性并提高LLM在开放式任务上的绩效。这些发现表明,在培训前阶段提高模型的基础能力的重要性,同时为推进训练后技术提供见解。我们的代码和脚本可在https://github.com/trestad/noisy-rewards-in-learning-to-reason上找到。
OMNICOSSISTENCE:学习风格的一致性来自配对的风格数据
-
标题: OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
-
作者: Yiren Song, Cheng Liu, Mike Zheng Shou
-
日期: 2025-05-24
-
ArXiv主页: https://arxiv.org/abs/2505.18445
-
论文链接: https://arxiv.org/pdf/2505.18445
-
gitHub仓库: https://github.com/showlab/OmniConsistency
英文摘要
Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o’s exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose OmniConsistency, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.
中文摘要
扩散模型具有高级图像样式化,但两个核心挑战仍然存在:(1)在复杂的场景,尤其是身份,构图和细节中保持一致的风格化,以及(2)防止与样式洛拉斯的图像到图像管道中的样式退化。GPT-4O的出色风格一致性突出了开源方法和专有模型之间的性能差距。为了弥合这一差距,我们提出了一个普遍的一致性插件,利用大规模扩散变压器(DITS)。OMNICONSISTENCENCES贡献:(1)在对齐图像对训练以稳定概括的内部上下文一致性学习框架;(2)一种两阶段的渐进式学习策略从一致性保存到减轻风格退化的脱钩风格学习;(3)在Flux框架下与任意风格的Loras兼容的完全插入设计。广泛的实验表明,Omnicansistency显着提高了视觉连贯性和美学质量,可实现与商业最先进的GPT-GPT-4O相当的性能。
推理模型是固执的:诊断推理模型中的指令覆盖
- 标题: Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
- 作者: Doohyuk Jang, Yoonjeon Kim, Chanjae Park, Hyun Ryu, Eunho Yang
- 日期: 2025-05-22
- ArXiv主页: https://arxiv.org/abs/2505.17225
- 论文链接: https://arxiv.org/pdf/2505.17225
- 项目链接: https://reasoningtrap.github.io/
- gitHub仓库: https://github.com/ReasoningTrap/ReasoningTrap
英文摘要
Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term reasoning rigidity. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, . Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.
中文摘要
大型语言模型表现出了长期和复杂的推理任务的熟练程度。但是,它们经常对熟悉的推理模式表现出有问题的依赖,这是我们认为僵化的现象。尽管用户明确说明了说明,但这些模型通常覆盖了明确指定的条件,并默认是习惯推理轨迹,从而得出了不正确的结论。这种行为提出了重大挑战,尤其是在数学和逻辑难题之类的领域中,精确地遵守指定的约束至关重要。为了系统地研究推理刚性,在先前的工作中基本未探索的行为,我们引入了专家策划的诊断集。我们的数据集包括现有数学基准的特殊修改变体,即AIME和MATH500,以及故意重新设计的众所周知的难题,需要偏离熟悉的推理策略。使用此数据集,我们确定当模型默认为根深蒂固的推理时发生的重复污染模式。具体来说,我们将这种污染分为三种独特的模式:(i)解释过载,(ii)输入不信任和(iii)部分指令注意,每种模型都会忽略或扭曲提供的指令。我们公开发布我们的诊断设置,以促进对语言模型中推理僵化的未来研究。
Bizfinbench:以商业为导向的现实世界的财务基准评估LLMS
- 标题: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
- 作者: Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu
- 日期: 2025-05-26
- ArXiv主页: https://arxiv.org/abs/2505.19457
- 论文链接: https://arxiv.org/pdf/2505.19457
- 项目链接: https://hithink-research.github.io/BizFinBench/
- gitHub仓库: https://github.com/vllm-project/vllm
英文摘要
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
中文摘要
大型语言模型在一般任务中脱颖而出,但在逻辑,精确的关键领域(如财务,法律和医疗保健)中评估其可靠性仍然具有挑战性。为了解决这个问题,我们介绍了Bizfinbench,这是第一个专门旨在评估现实世界中财务应用中LLM的基准。Bizfinbench由6,781个中文中的井井有条的查询组成,涵盖了五个维度:数值计算,推理,信息提取,预测识别和基于知识的问题答案,分为9个细粒类别。基准包括客观和主观指标。我们还介绍了IteraJudge,这是一种新型的LLM评估方法,当LLM充当客观指标的评估者时,可以减少偏见。我们基准了25个模型,包括专有和开源系统。广泛的实验表明,没有任何模型在所有任务中占主导地位。我们的评估揭示了不同的能力模式:(1)在数值计算中,Claude-3.5-Sonnet(63.18)和DeepSeek-R1(64.04)的铅,而QWEN2.5-VL-3B(15.92)之类的较小模型显着落后;(2)在推理中,专有模型占主导地位(Chatgpt-o3:83.58,gemini-2.0-flash:81.15),开源模型的最高为19.49分;(3)在信息提取中,性能差异是最大的,DeepSeek-R1得分为71.46,而QWEN3-1.7B得分为11.23;(4)在预测识别中,性能差异很小,顶级模型得分在39.16至50.00之间。我们发现,尽管当前的LLM可以胜任地处理常规融资查询,但它们在需要跨概念推理的复杂场景中挣扎。Bizfinbench为未来的研究提供了严格的,与业务一致的基准。代码和数据集可在https://github.com/hithink-research/bizfinbench上找到。
探索一步文本生成LLM的潜在能力
-
标题: Exploring the Latent Capacity of LLMs for One-Step Text Generation
-
作者: Gleb Mezentsev, Ivan Oseledets
-
日期: 2025-05-27
-
ArXiv主页: https://arxiv.org/abs/2505.21189
-
论文链接: https://arxiv.org/pdf/2505.21189
-
gitHub仓库: https://github.com/Glebzok/OneStepLLMGeneration
英文摘要
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.
中文摘要
最近的一项研究表明,大型语言模型(LLMS)可以通过仅仅从一个经过特殊训练的输入嵌入来重建出令人惊讶的长文本(多达数千个令牌)。在这项工作中,我们探讨了如果没有自动性,这种重建是否可以进行。我们表明,当仅提供两个学习的嵌入时,冷冻的LLM只能在一个前传中产生数百个准确的令牌。这揭示了LLM的令人惊讶且毫无疑问的能力 - 多token的生成而没有迭代解码。我们研究了这些嵌入的行为,并提供了对它们编码的信息类型的见解。我们还从经验上表明,尽管这些表示并不是给定文本的独特之处,但它们在嵌入空间中形成了连接的和本地区域,该属性暗示了学习专用编码器进入该空间的潜力。
一个RL可以看到所有内容:视觉三重统一加强学习
-
标题: One RL to See Them All: Visual Triple Unified Reinforcement Learning
-
作者: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
-
日期: 2025-05-23
-
ArXiv主页: https://arxiv.org/abs/2505.18129
-
论文链接: https://arxiv.org/pdf/2505.18129
-
gitHub仓库: https://github.com/MiniMax-AI/One-RL-to-See-Them-All
英文摘要
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.
中文摘要
强化学习(RL)已显着提高了视觉模型(VLM)的推理能力。但是,除了推理任务之外,RL的使用仍然在很大程度上尚未探索,尤其是对于诸如对象检测和接地之类的感知义务任务。我们提出了V-Triune,这是一种视觉三重统一的增强学习系统,使VLM可以在单个培训管道中共同学习视觉推理和感知任务。V-Triune包括三重互补组件:样本级数据格式(统一多样化的任务输入),验证者级奖励计算(通过专用验证者提供自定义奖励)和源级指标监视(以诊断数据源级别的问题)。我们进一步介绍了一种新颖的动态奖励,该奖励为V-Triune处理的感知任务提供了自适应,进步和确定的反馈。我们的方法是使用开源7B和32B骨干模型在现成的RL训练框架内实例化的。所得模型被称为Orsta(一个RL可以看到它们),在推理和感知任务中都表现出一致的改进。这种广泛的能力是通过在多种数据集上进行的培训来显着塑造的,该数据集围绕四个代表性的视觉推理任务(数学,拼图,图表和科学)以及四个视觉感知任务(接地,检测,计数和OCR)构建。随后,ORSTA在Mega-Bench Core上取得了可观的收益,其各种7B和32B模型变体的改进范围从+2.1到令人印象深刻的+14.1,其性能优势扩展到了广泛的下游任务。这些结果突出了我们统一的RL方法对VLM的有效性和可扩展性。V-Triune系统以及ORSTA模型在https://github.com/minimax-ai公开可用。
不要过度思考。更喜欢较短的思维链来改善LLM推理
- 标题: Don’t Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
- 作者: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz
- 日期: 2025-05-23
- ArXiv主页: https://arxiv.org/abs/2505.17813
- 论文链接: https://arxiv.org/pdf/2505.17813
英文摘要
Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive “thinking” chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer “thinking” does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.
中文摘要
推理大语言模型(LLMS)在很大程度上依赖于扩展测试时间计算来通过生成广泛的“思考”链来执行复杂的推理任务。在证明令人印象深刻的结果的同时,这种方法会产生大量的计算成本和推理时间。在这项工作中,我们挑战了以下假设:漫长的思维链会带来更好的推理能力。我们首先证明,单个问题中的推理链短得多,可以得出正确的答案 - 比同一问题采样最长的链条高达34.5%。基于这些结果,我们建议Short-M@K,这是一种新颖的推理LLM推理方法。一旦完成了第一个M思维过程,我们的方法将在并联中执行K独立的世代,并停止计算。最终答案是在这些M链中使用多数投票选择的。基本的Short-1@K在低计算设置中表现出比标准多数投票相似甚至优越的表现 - 使用多达40%的思维代币。Short-3@K,虽然效率略低于Short-1@K,但在所有计算预算中始终超过多数投票,同时仍然要快得多(最高33%的壁时间减少)。受我们的结果的启发,我们使用短,长和随机选择的推理链条对LLM进行了修订。然后,我们观察到较短的培训会导致表现更好。我们的发现表明,在推理LLM中重新思考当前的测试时间计算方法,强调更长的“思考”并不一定会转化为改善的性能,并且可以违反直觉会导致降级结果。
VF-eval:评估用于在AIGC视频上生成反馈的多模式LLM
-
标题: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
-
作者: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao
-
日期: 2025-05-29
-
ArXiv主页: https://arxiv.org/abs/2505.23693
-
论文链接: https://arxiv.org/pdf/2505.23693
-
gitHub仓库: https://github.com/SighingSnow/VF-EVAL
英文摘要
MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.
中文摘要
最近对MLLM进行了广泛的研究以进行视频问题回答。但是,大多数现有评估都集中在自然视频,忽略合成视频(例如AI生成的内容(AIGC))上。同时,一些视频生成的作品依靠MLLM来评估生成的视频的质量,但是MLLM在解释AIGC视频的功能仍然很大程度上尚未得到充满刺激。为了解决这个问题,我们提出了一个新的基准,即VF-eval,该基准介绍了四个任务 - 协调验证,错误意识,错误类型检测和推理评估,以全面评估MLLM在AIGC视频中的能力。我们在VF-Eval上评估了13个Frontier MLLM,并发现即使是表现最好的模型GPT-4.1,也努力在所有任务中都始终如一地实现良好的性能。这突出了我们基准的挑战性质。此外,为了调查VF-eval在改善视频生成中的实际应用,我们进行了一个实验,重新启动,表明将MLLM与人类反馈更加紧密地对齐可以使视频生成受益。
Skywork开放推理1技术报告
-
标题: Skywork Open Reasoner 1 Technical Report
-
作者: Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou
-
日期: 2025-05-28
-
ArXiv主页: https://arxiv.org/abs/2505.22312
-
论文链接: https://arxiv.org/pdf/2505.22312
-
gitHub仓库: https://github.com/SkyworkAI/Skywork-OR1
英文摘要
The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.
中文摘要
DeepSeek-R1的成功强调了增强学习(RL)在增强大语言模型(LLMS)的推理能力方面的重要作用。在这项工作中,我们介绍了SkyWork-OR1,这是长期链(COT)模型的有效且可扩展的RL实现。在DeepSeek-R1-Distill型号系列的基础上,我们的RL方法可实现显着的性能提高,在AIME24,AIME25和LiveCodeBench的平均准确度从32B型号的平均精度从57.8%到72.8%(+15.0%),从43.6%到57.5%(+13.9%)(+13.9%)。我们的SkyWork-OR1-32B模型在AIME24和AIME25基准上超过了DeepSeek-R1和Qwen3-32b,同时在livecodebench上取得了可比的结果。SkyWork-OR1-7B和Skywork-Or1-Math-7b模型在相似大小的模型中表现出竞争性推理能力。我们对培训管道的核心组成部分进行全面的消融研究,以验证其有效性。此外,我们彻底研究了熵崩溃的现象,确定影响熵动态的关键因素,并证明缓解过早熵塌陷对于改善测试性能至关重要。为了支持社区研究,我们将模型权重,培训代码和培训数据集完全开放。