当前位置：首页 > news >正文

【论文速递】2025年第25周(Jun-15-21)(Robotics/Embodied AI/LLM)

news 2025/9/22 13:21:09

中文使用 googletrans 翻译，翻译不对的地方以英文为准

Minimax-M1：利用闪电般的注意力高效计算缩放测试时间
- 英文摘要
- 中文摘要
Multifinben：用于财务LLM评估的多语言，多模式和困难的基准
- 英文摘要
- 中文摘要
科学家的第一次考试：通过感知，理解和推理探讨MLLM的认知能力
- 英文摘要
- 中文摘要
Deepresearch Batch：深入研究代理商的全面基准
- 英文摘要
- 中文摘要
Sekai：探索世界探索的视频数据集
- 英文摘要
- 中文摘要
LLM代理的缩放测试时间计算
- 英文摘要
- 中文摘要
反馈摩擦：LLM努力完全纳入外部反馈
- 英文摘要
- 中文摘要
CMI基础：用于评估音乐说明的综合基准。
- 英文摘要
- 中文摘要
等等，我们不需要“等待”！删除思维令牌提高了推理效率
- 英文摘要
- 中文摘要
dota-rag：思想聚集抹布的动态
- 英文摘要
- 中文摘要
从跨域的角度重新审视LLM推理的强化学习
- 英文摘要
- 中文摘要
将自回旋变压器和扩散与多引用自动进度结合
- 英文摘要
- 中文摘要
Longllada：在扩散LLMS中解锁长上下文功能
- 英文摘要
- 中文摘要
Essential-Web V1.0：有组织的Web数据24T令牌
- 英文摘要
- 中文摘要
大语言和多模型的离散扩散：调查
- 英文摘要
- 中文摘要
自我r1：超长以自我为中心的视频推理的工具链思想
- 英文摘要
- 中文摘要
具有可验证奖励的强化学习隐式激励基本LLM中的正确推理
- 英文摘要
- 中文摘要
Xolver：具有整体经验的多代理推理，就像奥林匹克团队一样
- 英文摘要
- 中文摘要
Genrecal：重新校准后从大型视觉模型重新校准后产生
- 英文摘要
- 中文摘要
有效的政策遵守代理商
- 英文摘要
- 中文摘要
扩散偶性
- 英文摘要
- 中文摘要
一切都没有丢失：LLM恢复没有检查点
- 英文摘要
- 中文摘要
蛋白质：原型作为LLMS中可推广推理的基础
- 英文摘要
- 中文摘要
通过跨模式注意力滴注对齐的新型视图图像和几何形状合成
- 英文摘要
- 中文摘要
任务：自动生成代理任务
- 英文摘要
- 中文摘要

Minimax-M1：利用闪电般的注意力高效计算缩放测试时间

标题: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
作者: MiniMax, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun
日期: 2025-06-16
ArXiv主页: https://arxiv.org/abs/2506.13585
论文链接: https://arxiv.org/pdf/2506.13585
项目链接: https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
gitHub仓库: https://github.com/MiniMax-AI/MiniMax-M1

英文摘要

We introduce MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1’s inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1’s full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

中文摘要

我们介绍了Minimax-M1，这是世界上第一个开放重量的大规模混合注意推理模型。Minimax-M1由杂种混合物（MOE）结构供电，并结合了闪电注意机制。该模型是基于我们以前的最小Text-01模型开发的，该模型包含4560亿个参数，每个令牌激活了459亿个参数。M1模型本质上支持上下文长度为100万个令牌，8倍DeepSeek R1的上下文大小。此外，Minimax-M1中的闪电注意机制可以有效地缩放测试时间计算。这些属性使M1特别适合需要处理长输入并广泛思考的复杂任务。Minimax-M1使用大规模加固学习（RL）培训，包括基于沙盒的现实世界软件工程环境，包括。除了M1对RL训练的固有效率优势外，我们还提出了Cispo，这是一种新型的RL算法，以进一步提高RL效率。CISPO剪辑了重要的样本权重，而不是代币更新，表现优于其他竞争性RL变体。将混合动力注意力结合起来和CISPO使Minimax-M1在512 H800 GPU上的完整RL培训仅在三周内完成，租金仅为534,700美元。我们分别发布了具有40K和80K思维预算的最小值M1模型的两个版本，其中40K模型代表了80K培训的中间阶段。标准基准测试的实验表明，我们的模型与强大的开放权重模型相当或优越，例如原始的DeepSeek-R1和Qwen3-235b，在复杂的软件工程，工具利用率和长篇文章任务中具有特殊优势。我们在https://github.com/minimax-ai/minimax-m1上公开发布minimax-m1。

Multifinben：用于财务LLM评估的多语言，多模式和困难的基准

标题: MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
作者: Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
日期: 2025-06-16
ArXiv主页: https://arxiv.org/abs/2506.14028
论文链接: https://arxiv.org/pdf/2506.14028

英文摘要

Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.

中文摘要

大型语言模型（LLM）的最新进展已加快了财务NLP和应用程序的进展，但是现有的基准仍限于单语言和单峰设置，通常过度地涉及简单任务，并且未能反映现实世界中财务交流的复杂性。我们介绍了MultiFinben，这是针对全球金融领域量身定制的第一个多语言和多模式基准，评估了跨模式（文本，视觉，音频）和语言设置（单语，双语，多语言）的LLM，涉及领域特定的任务。我们介绍了两个新任务，包括polyfiqa-easy和polyfiqa-expert，这是第一个多语言财务基准，要求模型对混合语言输入进行复杂的推理；和英国人和西班牙裔，这是第一个OCR限制的金融质量检查质量检查质量质量保证任务，挑战了从Visual-Text财务文件中提取和推理信息的模型。此外，我们提出了一种动态，困难的选择机制，并策划了一个紧凑的平衡基准，而不是简单的聚合现有数据集。对22种最先进模型的广泛评估表明，即使最强大的模型，尽管它们具有一般的多模式和多语言能力，但在面对金融领域中复杂的跨语义和多模式任务时，却急剧挣扎。Multifinben公开发布，以促进金融研究和应用中的透明，可重现和包容性进步。

科学家的第一次考试：通过感知，理解和推理探讨MLLM的认知能力

标题: Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
作者: Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10521
论文链接: https://arxiv.org/pdf/2506.10521

英文摘要

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

中文摘要

科学发现越来越依赖于基于信息密集型科学数据和特定领域专业知识的复杂多模式推理。在专家级的科学基准下，科学多模式模型（MLLMS）具有显着增强现实工作流程中这一发现过程的潜力。但是，当前的科学基准主要集中于评估MLLM的知识理解能力，从而导致对其看法和推理能力的评估不足。为了解决这一差距，我们介绍了科学家的第一项考试（SFE）基准，旨在通过三个相互联系的级别评估MLLM的科学认知能力：科学信号感知，科学属性理解，科学比较推理。具体而言，SFE包括在三种问题类型上进行的830个专家验证的VQA对，涵盖了五个高价值学科的66个多模式任务。广泛的实验表明，当前最新的GPT-O3和InternVL-3仅在SFE上获得34.08％和26.52％，这突出了MLLM在科学领域改善的重要空间。我们希望在SFE中获得的见解能够促进AI增强科学发现的进一步发展。

Deepresearch Batch：深入研究代理商的全面基准

标题: DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
作者: Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
日期: 2025-06-13
ArXiv主页: https://arxiv.org/abs/2506.11763
论文链接: https://arxiv.org/pdf/2506.11763
项目链接: https://deepresearch-bench.github.io
gitHub仓库: https://github.com/Ayanami0730/deep_research_bench

英文摘要

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports–compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA’s information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

中文摘要

深度研究代理是基于LLM的代理的重要类别。通过自主协调多步探索，有针对性的检索和高阶合成，它们将大量的在线信息转换为分析师级，引用丰富的报告 - 将手动台式研究用于几分钟。但是，系统地评估这些试剂能力的全面基准仍然没有。为了弥合这一差距，我们提出了Deepresearch Bench，这是一个由100个博士学位研究任务组成的基准，每个研究任务都是由22个不同领域的领域专家精心制作的。评估DRA本质上是复杂且劳动密集型的。因此，我们提出了两种新型方法，它们与人类判断力有很强的一致性。第一个是一种基于参考的方法，具有自适应标准，用于评估生成的研究报告的质量。引入了另一个框架来评估DRA的信息检索和收集功能，以评估其有效的引文数和整体引用准确性。我们在https://github.com/ayanami0730/deep_research_bench上提供了这些框架的开源Deepresearch台和关键组件，以加速基于LLM的实用代理。

Sekai：探索世界探索的视频数据集

标题: Sekai: A Video Dataset towards World Exploration
作者: Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang
日期: 2025-06-18
ArXiv主页: https://arxiv.org/abs/2506.15675
论文链接: https://arxiv.org/pdf/2506.15675
项目链接: https://lixsp11.github.io/sekai-project/
gitHub仓库: https://github.com/Lixsp11/sekai-codebase

英文摘要

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning dream’’ in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

中文摘要

视频生成技术取得了显着的进步，有望成为互动世界探索的基础。但是，现有的视频生成数据集并不适合世界探索培训，因为它们受到了某些局限性：限制的位置，持续时间短，静态场景以及对探索和世界的缺乏注释。在本文中，我们介绍了Sekai（在日语中的意思是世界''），这是一个高质量的第一人称视图世界视频数据集，具有丰富的世界探索注释。它包括超过5,000个小时的步行或无人机景观（FPV和UVA），来自100个国家和750个城市的地区。我们开发了一个有效的工具箱，以收集，预处理和注释视频，其中包括位置，场景，天气，人群密度，字幕和摄像头轨迹。实验证明了数据集的质量。而且，我们使用一个子集来训练一个互动视频世界探索模型，名为Yume（在日语中为梦想’'）。我们认为，Sekai将受益于视频生成和世界探索领域，并激励有价值的应用。

LLM代理的缩放测试时间计算

标题: Scaling Test-time Compute for LLM Agents
作者: King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou
日期: 2025-06-15
ArXiv主页: https://arxiv.org/abs/2506.12928
论文链接: https://arxiv.org/pdf/2506.12928
gitHub仓库: https://github.com/tmgthb/Autonomous-Agents

英文摘要

Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent’s task performance.

中文摘要

缩放测试时间计算在提高大语言模型（LLMS）的推理能力方面取得了巨大的成功。在这项工作中，我们对将测试时间缩放方法应用于语言代理进行了第一个系统探索，并研究了其提高其有效性的程度。具体而言，我们探讨了不同的测试时间缩放策略，包括：（1）并行采样算法；（2）顺序修订策略；（3）验证者和合并方法；（4）多元化推出的策略。我们仔细分析并消除了不同设计策略对将测试时间扩展应用于语言代理的影响，并有以下发现：1。缩放测试时间计算可以改善代理的性能。2.知道何时反映对代理很重要。3。在不同的验证和结果合并方法中，列表方法的表现最好。4。增加的推出会对代理商的任务绩效产生积极影响。

反馈摩擦：LLM努力完全纳入外部反馈

标题: Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
作者: Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi
日期: 2025-06-13
ArXiv主页: https://arxiv.org/abs/2506.11930
论文链接: https://arxiv.org/pdf/2506.11930
gitHub仓库: https://github.com/JHU-CLSP/Feedback-Friction

英文摘要

Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.

中文摘要

最近的研究表明，LLM具有一定的能力，可以在给予外部反馈时提高其反应。但是，尚不清楚这些模型如何有效，彻底地融合外部反馈。在理想的情况下，如果LLM收到几乎完美且完全的反馈，我们希望他们完全整合反馈并将其错误的答案更改为纠正词。在本文中，我们系统地研究了LLMS通过设计受控的实验环境来合并反馈的能力。对于每个问题，一个求解器模型都尝试解决方案，然后一个反馈生成器访问接近完整的地面真实答案会产生目标反馈，然后求解器再次尝试。我们在各种任务中评估了这一管道，包括数学推理，知识推理，科学推理以及一般的多域评估，其中包括Claude 3.7在内的最先进的语言模型（具有和没有扩展思维）。令人惊讶的是，即使在这些近乎理想的条件下，求解器模型也始终显示出对反馈的抵抗力，这是我们称为反馈摩擦的限制。为了减轻这种限制，我们试验基于抽样的策略，例如渐进的温度升高和明确拒绝先前尝试的错误答案，这会产生改进，但仍无法帮助模型实现目标性能。我们还对反馈摩擦的潜在原因进行了严格的探索，排除了模型过度自信和数据熟悉度等因素。我们希望在LLM中强调这个问题并排除几个明显的原因将有助于未来的自我完善研究。

CMI基础：用于评估音乐说明的综合基准。

标题: CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
作者: Yinghao Ma, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa
日期: 2025-06-14
ArXiv主页: https://arxiv.org/abs/2506.12285
论文链接: https://arxiv.org/pdf/2506.12285
gitHub仓库: https://github.com/nicolaus625/CMI-bench

英文摘要

Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.

中文摘要

音频文本大型语言模型（LLM）的最新进展为音乐理解和发电开辟了新的可能性。但是，现有的基准测试范围有限，通常依靠简化的任务或多项选择评估，这些任务无法反映现实世界音乐分析的复杂性。我们将广泛的传统mir注释作为指导遵循格式进行了重新诠释，并引入了CMI基础，这是一项全面的音乐指导，旨在评估各种音乐信息检索（MIR）任务的音频文本LLMS。其中包括流派分类，情绪回归，情感标签，仪器分类，音高估计，钥匙检测，歌词转录，旋律提取，人声技术识别，仪器性能技术检测，音乐标记，音乐字幕，音乐字幕和（下）击败跟踪：反映mir研究中的核心挑战。与以前的基准不同，CMI Bench采用标准化的评估指标与先前最先进的MIR模型一致，从而确保与监督方法的直接可比性。我们提供一个评估工具包，支持所有开源音频文本LLM，包括LTU，QWEN-AUDIO，SALMONN，MUSILINGO等。实验结果揭示了LLMS和监督模型之间的巨大性能差距，以及其文化，年代学和性别偏见，强调了当前模型的潜在和限制，该模型在解决MIR任务中的潜在和局限性。CMI Bench为评估音乐教学的统一基础建立了以下基础，并推动了音乐吸引LLM的进度。

等等，我们不需要“等待”！删除思维令牌提高了推理效率

标题: Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency
作者: Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, Tianyi Zhou
日期: 2025-06-10
ArXiv主页: https://arxiv.org/abs/2506.08343
论文链接: https://arxiv.org/pdf/2506.08343

英文摘要

Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.

中文摘要

大型推理模型的最新进展使得逐步推理了复杂的复杂，但通常会引入重大的过度思考，从而导致冗长和冗余输出阻碍效率。在这项研究中，我们检查了诸如“ wait”和“ hmm”之类的代币信号的显式自我反射是高级推理所必需的。我们提出了Nowait，这是一种简单而有效的方法，可以通过抑制这些代币在推断过程中抑制明确的自我反思。在文本，视觉和视频推理任务上进行十个基准测试的大量实验表明，现在，在五个R1式型号系列中，无需损害模型效用，现在将思想链轨迹长度降低了高达27％-51％。现在，现在提供了一个插件解决方案，以实现高效且具有公用事业的多模式推理。

dota-rag：思想聚集抹布的动态

标题: DoTA-RAG: Dynamic of Thought Aggregation RAG
作者: Saksorn Ruangtanusak, Natthapath Rungseesiripak, Peerawat Rojratchadakorn, Monthol Charattrakool, Natapong Nitarach
日期: 2025-06-14
ArXiv主页: https://arxiv.org/abs/2506.12571
论文链接: https://arxiv.org/pdf/2506.12571

英文摘要

In this paper, we introduce DoTA-RAG (Dynamic-of-Thought Aggregation RAG), a retrieval-augmented generation system optimized for high-throughput, large-scale web knowledge indexes. Traditional RAG pipelines often suffer from high latency and limited accuracy over massive, diverse datasets. DoTA-RAG addresses these challenges with a three-stage pipeline: query rewriting, dynamic routing to specialized sub-indexes, and multi-stage retrieval and ranking. We further enhance retrieval by evaluating and selecting a superior embedding model, re-embedding the large FineWeb-10BT corpus. Moreover, we create a diverse Q&A dataset of 500 questions generated via the DataMorgana setup across a broad range of WebOrganizer topics and formats. DoTA-RAG improves the answer correctness score from 0.752 (baseline, using LiveRAG pre-built vector store) to 1.478 while maintaining low latency, and it achieves a 0.929 correctness score on the Live Challenge Day. These results highlight DoTA-RAG’s potential for practical deployment in domains requiring fast, reliable access to large and evolving knowledge sources.

中文摘要

在本文中，我们介绍了Dota-rag（动态的聚合抹布），这是一种针对高通量，大规模的Web知识索引的检索增强的生成系统。传统的抹布管道通常会遭受高潜伏期的高度和有限的准确性，而准确性在大量的，多样化的数据集中。Dota-rag通过三阶段的管道解决了这些挑战：查询重写，与专门的子索引的动态路由以及多阶段的检索和排名。我们通过评估和选择高级嵌入模型，重新安装了大型FineWeb-10bt语料库来进一步增强检索。此外，我们创建了一个通过Datamorgana设置生成的500个问题的各种问答数据集，这些问题在广泛的Weboranizer主题和格式中生成。Dota-rag将答案正确度得分从0.752（基线，使用Liverag Pre-Build Vector Store）提高到1.478，同时保持低潜伏期，并且在现场挑战日达到了0.929的正确性得分。这些结果突出了Dota-Rag在需要快速，可靠地访问大型和不断发展的知识源的域中实用部署的潜力。

从跨域的角度重新审视LLM推理的强化学习

标题: Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
作者: Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.14965
论文链接: https://arxiv.org/pdf/2506.14965
项目链接: https://guru-reasoning.github.io/
gitHub仓库: https://github.com/LLM360/Reasoning360

英文摘要

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains–Math, Code, Science, Logic, Simulation, and Tabular–each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

中文摘要

强化学习（RL）已成为改善大型语言模型（LLM）推理的一种有前途的方法，但大多数开放式努力都集中在数学和代码上，从而限制了我们对其对一般推理的广泛适用性的理解。一个关键挑战在于缺乏各种推理领域的可靠，可扩展的RL奖励信号。我们介绍了Guru，这是一个由六个推理领域的92K可验证示例的策划的RL推理语料库，包括六个推理领域（代码，科学，逻辑，模拟和表格） - 通过特定领域的奖励设计，重复数据删除和过滤，以确保RL训练的可靠性和有效性。基于大师，我们系统地重新审视了RL在LLM推理中建立的发现，并观察到域之间的显着差异。例如，虽然先前的工作表明RL主要从预算模型中引起现有的知识，但我们的结果揭示了一个更细微的模式：在训练预读（数学，代码，科学，科学）中经常看到的领域，从跨域RL培训中受益，而跨域RL培训的域，而在预处理预处理的域中，逻辑，模拟和表格训练需要实现有意义的表现，才能获得有意义的效果，以实现有意义的效果，以提高能力的促进。最后，我们介绍了Guru-7b和Guru-32b，这是两个模型，它们在开放模型的RL训练中以公开可用的数据实现最先进的性能，在我们的17任任务评估套件中，在六个推理领域的17任任务评估套件中，最佳最佳基线的表现优于7.9％和6.7％。我们还表明，我们的模型有效地改善了基本模型的PASS@K性能，尤其是在预处理数据中出现的复杂任务的可能性较小。我们发布数据，模型，培训和评估代码，以促进通用推理：https：//github.com/llm360/reasoning360

将自回旋变压器和扩散与多引用自动进度结合

标题: Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression
作者: Dingcheng Zhen, Qian Qiao, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09482
论文链接: https://arxiv.org/pdf/2506.09482
gitHub仓库: https://github.com/TransDiff/TransDiff

英文摘要

We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fr’echet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.

中文摘要

我们介绍了Transdiff，这是将自回归（AR）变压器与扩散模型相结合的第一个图像生成模型。在这个联合建模框架中，transdiff将标签和图像编码为高级语义特征，并采用扩散模型来估计图像样品的分布。在Imagenet 256x256基准上，Transdiff明显优于基于独立AR变压器或扩散模型的其他图像生成模型。具体而言，Transdiff的fr 'Echet成立距离（FID）为1.61，与基于AR Transformer的最先进方法相比，与X112的最先进方法相比，与扩散模型相比，与最先进的方法相比，与最先进的方法相比，与最先进的方法相比，X2更快地提供了X2的推理潜伏期。此外，在Transdiff模型的基础上，我们引入了一种称为多引用自动进度（MRAR）的新型图像生成范式，该范式通过预测下一个图像来执行自回归产生。MRAR使该模型能够参考多个先前生成的图像，从而促进学习更多样化的表示形式，并在随后的迭代中提高生成的图像的质量。通过应用MRAR，Transdiff的性能得到改善，FID从1.61降低到1.42。我们希望Transdiff在图像生成领域开放一个新的边界。

Longllada：在扩散LLMS中解锁长上下文功能

标题: LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
作者: Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.14429
论文链接: https://arxiv.org/pdf/2506.14429
gitHub仓库: https://github.com/OpenMOSS/LongLLaDA

英文摘要

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textit{stable perplexity} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{local perception} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.

中文摘要

大型语言扩散模型或扩散LLM已成为NLP研究的重要重点，并致力于理解其可扩展性和下游任务绩效。但是，它们的长期文化功能仍未开发，缺乏系统分析或上下文扩展的方法。在这项工作中，我们提出了第一次系统调查，比较了扩散LLM和传统自动回归LLM的长篇文本性能。我们首先确定了扩散llms的独特特征，与自动回归LLM不同，它们在直接上下文推断中保持了明显的\ textit {稳定的困惑}。此外，如果自动回归模型在围绕一项框架的任务中完全失败，而上下文的长度超过了其预验证的长度，那么我们发现扩散的llms表现出独特的\ textit {local感知}现象，从而成功地从最近的上下文细分中获得了成功的检索。我们通过旋转位置嵌入（绳索）缩放理论的镜头来解释这两种现象。在这些观察结果的基础上，我们提出了Longllada，这是一种将LLADA与基于NTK的绳索外推的无训练方法。我们的结果验证了已建立的外推定缩放定律仍然有效地扩展了扩散LLM的上下文窗口。此外，我们确定了长篇小说任务，其中扩散llms的表现优于自动回归LLM，而其他人则缺乏。因此，这项研究确立了第一种环境推断LLM的推断方法，同时提供了基本的理论见解和经验基准，这对于推进对长期文化扩散LLM的未来研究至关重要。

Essential-Web V1.0：有组织的Web数据24T令牌

标题: Essential-Web v1.0: 24T tokens of organized web data
作者: Essential AI, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.14111
论文链接: https://arxiv.org/pdf/2506.14111
gitHub仓库: https://github.com/Essential-AI/eai-taxonomy

英文摘要

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

中文摘要

数据在语言模型如何获得技能和知识方面起着最突出的作用。缺乏大规模，组织良好的预训练数据集导致昂贵且无法访问的数据管道。我们提出了基本WEB v1.0，这是一个24万亿英里的数据集，其中每个文档都用十二类分类学分类，涵盖主题，格式，内容复杂性和质量。分类标签由EAI-DISTILL-0.5B生产，这是一种微调的0.5B参数模型，可在QWEN2.5-32B-Instruct的3％内实现注释者一致。只有SQL风格的过滤器，我们可以在数学（相对于SOTA），Web代码（+14.3％），STEM（+24.5％）和医疗（+8.6％）中获得具有竞争性的网络策划数据集（-8.0％）。Essential-Web V1.0可在HuggingFace上提供：https：//huggingface.co/datasets/essentialai/essential-web-v1.0

大语言和多模型的离散扩散：调查

标题: Discrete Diffusion in Large Language and Multimodal Models: A Survey
作者: Runpeng Yu, Qi Li, Xinchao Wang
日期: 2025-06-16
ArXiv主页: https://arxiv.org/abs/2506.13759
论文链接: https://arxiv.org/pdf/2506.13759
gitHub仓库: https://github.com/LiQiiiii/DLLM-Survey

英文摘要

In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output controllability, and dynamic, response-aware perception. These capabilities are previously difficult to achieve with AR models. Recently, a growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10x acceleration in inference speed. The advancement of discrete diffusion LLMs and MLLMs has been largely driven by progress in two domains. The first is the development of autoregressive LLMs and MLLMs, which has accumulated vast amounts of data, benchmarks, and foundational infrastructure for training and inference. The second contributing domain is the evolution of the mathematical models underlying discrete diffusion. Together, these advancements have catalyzed a surge in dLLMs and dMLLMs research in early 2025. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models. We further analyze key techniques for training and inference, and summarize emerging applications across language, vision-language, and biological domains. We conclude by discussing future directions for research and deployment. Paper collection: https://github.com/LiQiiiii/DLLM-Survey

中文摘要

在这项工作中，我们提供了对离散扩散语言模型（DLLM）和离散扩散多模式模型（DMLLMS）的系统调查。与自动回归（AR）模型不同，DLLM和DMLLMS使用全面关注和基于Denoising的发电策略采用了多键，平行的解码范式。该范式自然可以实现并行生成，细粒的输出可控性和动态响应感知感知。这些功能以前很难通过AR模型实现。最近，越来越多的工业规模的专有D（M）LLM以及大量开源学术D（M）LLMS表现出与他们自动回归的同行相当的性能，而推理速度则达到了10倍加速。离散扩散LLM和MLLM的进步主要由两个领域的进步驱动。首先是开发自回旋的LLM和MLLM，它们积累了大量数据，基准和基础基础设施，用于培训和推理。第二个贡献域是离散扩散基础数学模型的演变。这些进步共同促进了2025年初的DLLM和DMLLMS研究的激增。在这项工作中，我们介绍了DLLM和DMLLM域研究的全面概述。我们追踪DLLM和DMLLM的历史发展，正式化基础数学框架，并对代表性模型进行分类。我们进一步分析了培训和推理的关键技术，并跨语言，视觉语言和生物领域总结了新兴应用。我们通过讨论未来的研究和部署方向来结束。纸收藏：https：//github.com/liqiiiii/dllm-survey

自我r1：超长以自我为中心的视频推理的工具链思想

标题: Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
作者: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
日期: 2025-06-16
ArXiv主页: https://arxiv.org/abs/2506.13654
论文链接: https://arxiv.org/pdf/2506.13654
项目链接: https://egolife-ai.github.io/Ego-R1/
gitHub仓库: https://github.com/egolife-ai/Ego-R1

英文摘要

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

中文摘要

我们介绍了EGO-R1，这是一个新颖的框架，用于通过超长（即几天和几周）的以上为中心的视频进行推理，该视频利用了结构化的工具链（COTT）过程，该过程由经过EGO-R1通过加强学习（RL）训练的EGO-R1代理（RL）精心策划。受到人类解决问题的策略的启发，Cott将复杂的推理分解为模块化步骤，RL代理调用特定的工具，每步一步，迭代和协作地回答子问题，以解决诸如时间检索和多模式理解之类的任务。我们设计了一个两阶段的训练范式，其中涉及使用COTT数据和RL的预审前语言模型的监督填充（SFT），以使我们的代理商能够动态地提出逐步提出远程推理的工具。为了促进培训，我们构建了一个称为EGO-R1数据的数据集，该数据集由用于RL的EGO-COTT-25K和EGO-QA-4.4K组成。此外，我们的EGO-R1代理在新策划的一周的视频QA基准EGO-R1基准中进行了评估，它包含来自混合源的人类验证的QA对。广泛的结果表明，我们的EGO-R1代理商的动态，工具增强的思想推理可以有效地应对理解超长自我的视频的独特挑战，从而将时间覆盖范围从几个小时大大延长到一周。

具有可验证奖励的强化学习隐式激励基本LLM中的正确推理

标题: Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
作者: Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.14245
论文链接: https://arxiv.org/pdf/2506.14245

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

中文摘要

具有可验证奖励（RLVR）的强化学习已成为推进大语言模型（LLMS）推理能力的有希望的范式。然而，一个关键的悖论掩盖了其功效：RLVR调节的模型通常在通过@k度量的通行证上表现出其基本模型，以解决解决方案，从而导致假设RLVR仅以推理多样性为代价，而RLVR仅重新体重重重的现有推理路径。在这项工作中，我们通过识别问题的根源来解决这一矛盾：通行证@k公制本身是一种有缺陷的推理量度，因为它认为正确的最终答案可能是由于不准确或不完整的思想链（COTS）而产生的。为了解决这个问题，我们介绍了一个更精确的评估度量标准cot-pass@k，该指标要求推理路径和最终答案都是正确的。我们提供了一个新的理论基础，该基础将RLVR（与传统RL不同）形式化以激励逻辑完整性的独特结构。我们的经验结果是支持的：使用COT-PASS@K，我们观察到RLVR可以激励对K.的所有值的正确推理的概括，此外，通过分析训练动力学，我们发现这种增强的推理能力在训练过程中早期出现并顺利进行。我们的工作对RLVR的作用提供了清晰的看法，为其评估提供了一种更可靠的方法，并确认了其真正提高机器推理的潜力。

Xolver：具有整体经验的多代理推理，就像奥林匹克团队一样

标题: Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team
作者: Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez
日期: 2025-06-17
ArXiv主页: https://arxiv.org/abs/2506.14234
论文链接: https://arxiv.org/pdf/2506.14234

英文摘要

Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME’24 (94.4%), AIME’25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

中文摘要

尽管在复杂的推理上取得了令人印象深刻的进展，但当前的大型语言模型（LLM）通常孤立地运作 - 将每个问题视为一个独立的尝试，而无需累积或整合体验知识。相反，专家问题解决者 - 例如奥林匹克或编程竞赛团队 - 利用丰富的经验挂毯：从教练中吸收指导，发展过去问题的直觉，利用对工具使用和图书馆功能的知识，调整策略的知识，基于同龄人的专业知识，通过竞争不断地进行竞争，甚至在其他相关方面进行竞争，甚至可以通过竞争进行竞争，甚至可以通过竞争进行竞争。我们介绍了Xolver，这是一个无训练的多代理推理框架，该框架使黑盒LLM具有持久的，不断发展的整体体验的记忆。Xolver整合了各种体验模式，包括外部和自我恢复，工具使用，协作互动，代理驱动的评估和迭代精致。通过从相关策略，代码片段和推理时的抽象推理模式中学习，Xolver避免了从头开始生成解决方案 - 标志着从孤立的推理向意识到的语言代理的过渡。Xolver建立在开放式和专有模型的基础上，始终优于专业推理剂。即使有轻量级的骨干（例如QWQ-32B），它也经常超过包括QWEN3-235B，Gemini 2.5 Pro，O3和O4-Mini-High在内的高级模型。借助O3米尼高，它可以在GSM8K（98.1％），Aime’24（94.4％），Aime’25（93.7％），Math-500（99.8％）和LiveCodeBench-V5（91.6％）（91.6％）（91.6％）上取得新的最佳效果。代码和数据可从https://kagnlp.github.io/xolver.github.io/获得。

Genrecal：重新校准后从大型视觉模型重新校准后产生

标题: GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
作者: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
日期: 2025-06-18
ArXiv主页: https://arxiv.org/abs/2506.15681
论文链接: https://arxiv.org/pdf/2506.15681
项目链接: https://byungkwanlee.github.io/GenRecal-page/

英文摘要

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

中文摘要

视觉模型（VLM）的最新进展已利用大型语言模型（LLMS）在诸如GPT-4V（例如GPT-4V）的封闭源系统上实现性能。但是，由于其实质性的计算需求，将这些模型部署在现实世界中，尤其是在资源受限设备上，仍然具有挑战性。这激发了人们将知识从大型VLM提炼成较小，更有效的对应物的兴趣。这里的关键挑战是由VLM架构的多样性引起的，VLM架构的多样性是建立在不同的LLM上的，并采用了不同的令牌类型，这些类型在词汇尺寸，令牌拆分和令牌索引订购方面有所不同。为了解决对特定VLM类型的限制的挑战，我们在重新校准后（GenRecal）是一种新型的VLMS通用蒸馏框架（Genrecal）。GenRecal结合了一种重新安装并适应异质VLM之间具有表示形式的重新配对器，从而在不同类型的VLMS上实现了有效的知识转移。通过对多个具有挑战性的基准测试的广泛实验，我们证明了Genrecal显着改善了基线性能，最终表现优于大规模开放式和封闭源VLM。

有效的政策遵守代理商

标题: Effective Red-Teaming of Policy-Adherent Agents
作者: Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09600
论文链接: https://arxiv.org/pdf/2506.09600

英文摘要

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

中文摘要

以严格的策略（例如退款资格或取消规则），越来越多地将面向任务的基于任务的LLM代理使用。挑战在于确保代理商始终遵守这些规则和政策，适当拒绝任何违反他们的要求，同时仍保持有益自然的互动。这要求开发量身定制的设计和评估方法，以确保对恶意用户行为的代理弹性。我们提出了一个新颖的威胁模型，该模型的重点是旨在利用政策遵守代理商以获得个人利益的对抗用户。为了解决这个问题，我们介绍了Craft，这是一种多代理红色团队制度，利用政策感知的有说服力的策略在客户服务方案中破坏政策支持的代理商，表现优于常规越狱方法，例如DAN提示，情感操纵和强制性。在现有的TAU基础基准测试的基础上，我们介绍了Tau-Break，这是一种互补的基准测试，旨在严格评估代理商对操纵用户行为的鲁棒性。最后，我们评估了几种直接但有效的防御策略。尽管这些措施提供了一定的保护，但它们不足，强调了需要更强大，以研究驱动的保护措施来保护政策遵守代理免受对抗性攻击

扩散偶性

标题: The Diffusion Duality
作者: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10892
论文链接: https://arxiv.org/pdf/2506.10892
项目链接: https://s-sahoo.com/duo/
gitHub仓库: https://github.com/s-sahoo/duo

英文摘要

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo

中文摘要

统一的离散扩散模型由于其固有的自我校正能力而具有快速文本生成的希望。但是，它们通常超过自回归模型和掩盖扩散模型的表现。在这项工作中，我们通过利用关键洞察力来缩小这种性能差距：统一的扩散过程自然而然地从潜在的高斯扩散中出现。我们的二重奏从高斯扩散转移了强大的技术，以改善训练和采样。首先，我们引入了以高斯流程为指导的课程学习策略，通过降低差异来加倍训练速度。接受课程学习训练的模型在7个基准中的3个基准中以零射击的零射模型超级回归模型。其次，我们提出离散的一致性蒸馏，该蒸馏会适应从连续设置到离散设置的一致性蒸馏。该算法通过通过两个数量级加速采样来在扩散语言模型中解锁几步的生成。我们在项目页面上提供代码和模型检查点：http：//s-sahoo.github.io/duo

一切都没有丢失：LLM恢复没有检查点

标题: All is Not Lost: LLM Recovery without Checkpoints
作者: Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
日期: 2025-06-18
ArXiv主页: https://arxiv.org/abs/2506.15461
论文链接: https://arxiv.org/pdf/2506.15461
项目链接: https://www.gensyn.ai/articles/checkfree
gitHub仓库: https://github.com/gensyn-ai/CheckFree

英文摘要

Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator’s scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.

中文摘要

培训LLM对分散和w弱的计算节点，例如多个现场实例，降低培训成本并实现模型民主化。这里不可避免的挑战是由于失败和操作员的调度政策引起的节点的流失，导致失去阶段 - 模型的一部分。从失败中恢复的常规方法是使用检查点，在该检查点上定期将整个模型的副本发送到额外的存储或冗余计算。这些方法即使在非失败案例中也会产生大量的沟通和/或计算开销，并且在具有大型模型的设置中缩放较差。在本文中，我们提出了一种有效的恢复方法，其中失败阶段被最接近的邻近阶段的加权平均值代替。与艺术的状态相反，CheckFree不需要其他计算或存储。但是，由于平均相邻阶段的性质，它只能恢复中间阶段的失败。我们进一步将方法扩展到coothfree+，并通过排序管道执行执行，以容忍第一个和最后一个阶段的崩溃。多亏了排序的管道，这些阶段的行为被其相邻的阶段模仿，这允许CheckFree+通过简单地从直接邻居中复制权重来恢复它们。为了能够恢复（DE）嵌入层，请将这些图层复制到相邻的阶段，这需要相对较小的存储空间。我们对模型大小的Llama模型的方法从1.24m到1.5B进行了广泛的评估，故障频率不同。在低和中衰竭率（5-10％）的情况下，将检查点和冗余计算均超过12％。我们的这两个建议均可通过我们的代码运行：https：//github.com/gensyn-ai/checkfree。

蛋白质：原型作为LLMS中可推广推理的基础

标题: ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
作者: Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu, Shuangzhi Wu, Junchi Yan
日期: 2025-06-18
ArXiv主页: https://arxiv.org/abs/2506.15211
论文链接: https://arxiv.org/pdf/2506.15211
gitHub仓库: https://github.com/codelion/optillm

英文摘要

Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes – fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

中文摘要

在接受长期思考（长床）推理训练的大型推理模型（LRMS）中的最新进展表现出显着的跨域泛化能力。但是，支持这种转移的基本机制仍然很少理解。我们假设跨域概括是由共同的抽象推理原型引起的 - 捕获跨领域问题本质的基本推理模式。These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction将问题转化为相应原型表示的管道；（2）通过Prolog/PDDL口译员提供可靠的反馈；（3）在原型空间内任意合成问题的可伸缩性，同时确保正确性。广泛的实验表明，在逻辑推理（ENIGMATA-EVAL）上，蛋白质的提高了4.7％，计划任务提高了6.3％，一般推理提高了4.0％的一般推理（MMLU）和数学上的1.0％（AIME24）。值得注意的是，我们的消融研究证实，与仅根据自然语言表示的培训相比，原型空间中的学习也表明了对结构相似问题的概括，这证实了我们的假设，即推理原型是大语模型中可推广推理的基础。

通过跨模式注意力滴注对齐的新型视图图像和几何形状合成

标题: Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
作者: Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim
日期: 2025-06-13
ArXiv主页: https://arxiv.org/abs/2506.11924
论文链接: https://arxiv.org/pdf/2506.11924
项目链接: https://cvlab-kaist.github.io/MoAI/
gitHub仓库: https://github.com/cvlab-kaist/MoAI

英文摘要

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

中文摘要

我们介绍了一个基于扩散的框架，该框架通过翘曲和侵蚀方法执行了一致的新型视图图像和几何形状生成。与需要密集的姿势图像或姿势变成生成模型的先前方法不同，我们的方法利用了现成的几何形状预测器来预测从参考图像中观察到的部分几何形状，并将新颖的视图合成作为图像和几何形状的构成任务。为了确保生成的图像和几何形状之间的准确比对，我们提出了跨模式的注意蒸馏，其中将图像扩散分支的注意图注入训练和推理期间，将图像扩散分支注入了平行的几何扩散分支。这种多任务方法实现了协同效应，促进了几何鲁棒的图像合成以及定义明确的几何预测。我们进一步介绍了基于接近度的网格条件，以整合深度和正常线索，在点云之间插值并通过影响生成过程的错误预测几何形状。从经验上讲，我们的方法在一系列看不见的场景中实现了图像和几何形状上的高保真外推视综合，在插值设置下提供了竞争性的重建质量，并产生几何校准的有色点云，以综合3D完成。项目页面可从https://cvlab-kaist.github.io/moai获得。

任务：自动生成代理任务

标题: TaskCraft: Automated Generation of Agentic Tasks
作者: Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Yang, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.10055
论文链接: https://arxiv.org/pdf/2506.10055
gitHub仓库: https://github.com/OPPO-PersonalAI/TaskCraft

英文摘要

Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce TaskCraft, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.

中文摘要

需要使用自主权，工具使用和自适应推理解决多个步骤问题的代理任务正变得越来越重要。但是，现有的指令数据缺乏工具交互，而当前的代理基准依赖于昂贵的人类注释，从而限制了其可扩展性。我们介绍了TaskCraft，这是一种自动化工作流，用于生成难度 - 尺度，多工具和可验证的代理任务，并具有执行轨迹。TaskCraft使用基于深度和基于宽度的扩展来扩展原子任务，以在结构和分层上创建复杂的挑战。经验结果表明，这些任务改善了生成工作流程中的迅速优化，并增强了对代理基础模型的监督微调。我们提出了一个大规模的合成数据集，该数据集大约36,000个任务，在支持未来的代理调整和评估研究方面有不同的困难。

查看全文

http://www.dtcms.com/a/394480.html

串口通信介绍

2025windows系统40系英伟达GPU显卡Pytorch环境配置（base:py3.8)

第15章报警管理

Java并发安全解析

三次样条曲线速度规划方法介绍

重拓扑建模之陶瓷摆件的诞生

计算机视觉数据预处理核心：空间变化与归一化的深度解析与实战指南

PCIe 8.0协议规范0.3版本发布！

【Leetcode hot 100】199.二叉树的右视图

Product Hunt 每日热榜 | 2025-09-21

CMake进阶教程：库文件构建、发布及列表操作技巧

因果推断DAGs和控制变量：如何使用有向无环图选择因果推断的控制变量

Nginx优化全攻略（上）：基础配置优化！

七、Scala 包、样例类与样例对象

CSP - 2025 普及组初赛试题及解析

Matlab实现点云的体素下采样

淘宝 item_search_img（拍立淘）API 接口获取与应用指南

Python网络请求库requests使用详述

B站弹幕相关工具

23 webUI应用基础案例-线稿上色

【MicroPython编程】-深入了解MicroPython 的垃圾收集

STM32F429I-DISC1【板载LED呼吸灯】

OBOO鸥柏工业触摸屏：信创国产化芯片驱动，展现军工级卓越性能

Ubantu命令行指令大全

字节面试题：正则化技术如何影响网络梯度

Java进阶教程，全面剖析Java多线程编程，死锁，笔记15

【含文档+PPT+源码】基于SpringBoot+Vue的车牌识别实时交通流量统计系统

C++动态规划4

chmod命令

kernel 6.6中新增的EEVDF特性

目录

Minimax-M1：利用闪电般的注意力高效计算缩放测试时间

英文摘要

中文摘要

Multifinben：用于财务LLM评估的多语言，多模式和困难的基准

英文摘要

中文摘要

科学家的第一次考试：通过感知，理解和推理探讨MLLM的认知能力

英文摘要

中文摘要

Deepresearch Batch：深入研究代理商的全面基准

英文摘要

中文摘要

Sekai：探索世界探索的视频数据集

英文摘要

中文摘要

LLM代理的缩放测试时间计算

英文摘要

中文摘要

反馈摩擦：LLM努力完全纳入外部反馈

英文摘要

中文摘要

CMI基础：用于评估音乐说明的综合基准。

英文摘要

中文摘要

等等，我们不需要“等待”！删除思维令牌提高了推理效率

英文摘要

中文摘要

dota-rag：思想聚集抹布的动态

英文摘要

中文摘要

从跨域的角度重新审视LLM推理的强化学习

英文摘要

中文摘要

将自回旋变压器和扩散与多引用自动进度结合

英文摘要

中文摘要

Longllada：在扩散LLMS中解锁长上下文功能

英文摘要

中文摘要

Essential-Web V1.0：有组织的Web数据24T令牌

英文摘要

中文摘要

大语言和多模型的离散扩散：调查

英文摘要

中文摘要

自我r1：超长以自我为中心的视频推理的工具链思想

英文摘要

中文摘要

具有可验证奖励的强化学习隐式激励基本LLM中的正确推理

英文摘要

中文摘要

Xolver：具有整体经验的多代理推理，就像奥林匹克团队一样

英文摘要

中文摘要

Genrecal：重新校准后从大型视觉模型重新校准后产生

英文摘要

中文摘要

有效的政策遵守代理商

英文摘要

中文摘要

扩散偶性

英文摘要

中文摘要

一切都没有丢失：LLM恢复没有检查点

英文摘要

中文摘要

蛋白质：原型作为LLMS中可推广推理的基础

英文摘要

中文摘要

通过跨模式注意力滴注对齐的新型视图图像和几何形状合成

英文摘要

中文摘要

任务：自动生成代理任务

英文摘要

中文摘要

相关文章：