当前位置: 首页 > news >正文

【论文速递】2025年第18周(Apr-27-May-03)(Robotics/Embodied AI/LLM)

中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

  • 要了解任何视频中的相机动作
    • 英文摘要
    • 中文摘要
  • 通过一个培训示例,在大型语言模型中进行推理的强化学习
    • 英文摘要
    • 中文摘要
  • 排行榜的幻想
    • 英文摘要
    • 中文摘要
  • Universalrag:以多种方式和粒度的多个语料库的检索来检索一代
    • 英文摘要
    • 中文摘要
  • WebThinker:具有深厚研究能力的大型推理模型的能力
    • 英文摘要
    • 中文摘要
  • 唤醒:通过小语言模型推进阿拉伯语
    • 英文摘要
    • 中文摘要
  • Skywork R1V2:推理的多模式混合增强学习
    • 英文摘要
    • 中文摘要
  • DeepCritic:大型语言模型的故意批评
    • 英文摘要
    • 中文摘要
  • 理性:培训追捕者的推理任务
    • 英文摘要
    • 中文摘要
  • PHI-4改良技术报告
    • 英文摘要
    • 中文摘要
  • PHI-4-MINI-RENOSING:探索数学中小推理语言模型的极限
    • 英文摘要
    • 中文摘要
  • Bitnet V2:1位LLMS Hadamard转换的本机4位激活
    • 英文摘要
    • 中文摘要
  • 互动生成视频的调查
    • 英文摘要
    • 中文摘要
  • T2I-R1:通过协作语义级别和令牌级别的COT加强图像生成
    • 英文摘要
    • 中文摘要
  • 迈向评估思维:通过不断发展的奖励模型优化的元政策优化
    • 英文摘要
    • 中文摘要
  • DeepSeek-R1之后的100天:关于复制研究和推理语言模型的更多指示的调查
    • 英文摘要
    • 中文摘要
  • 软件:没有注意力降低,没有校正软的巨大激活
    • 英文摘要
    • 中文摘要
  • reptext:通过复制渲染视觉文本
    • 英文摘要
    • 中文摘要
  • 紧凑:组成原子能到复杂的视觉能力调整
    • 英文摘要
    • 中文摘要
  • 自我生成的文化示例改善了LLM代理的顺序决策任务
    • 英文摘要
    • 中文摘要
  • LLMS中的临床知识不会转化为人类互动
    • 英文摘要
    • 中文摘要
  • 除了最后一个答案之外:您的推理痕迹比您想象的要多
    • 英文摘要
    • 中文摘要
  • Videovista culturallingo:360^Circ Horizo​​ns-Bridging文化,语言和视频理解中的领域
    • 英文摘要
    • 中文摘要
  • 电话自动化中的LLM驱动GUI代理:调查进度和前景
    • 英文摘要
    • 中文摘要
  • Tesseract:学习4D体现世界模型
    • 英文摘要
    • 中文摘要

要了解任何视频中的相机动作

  • 标题: Towards Understanding Camera Motions in Any Video
  • 作者: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
  • 日期: 2025-04-21
  • ArXiv主页: https://arxiv.org/abs/2504.15376
  • 论文链接: https://arxiv.org/pdf/2504.15376
  • 项目链接: https://linzhiqiu.github.io/papers/camerabench/
  • gitHub仓库: https://github.com/sy77777en/CameraBench

英文摘要

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like “follow” (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

中文摘要

我们介绍了Camerabench,这是一个大规模数据集和基准测试,旨在评估和改善相机运动的理解。Camerabench由约3,000个不同的互联网视频组成,通过严格的多级质量控制过程,由专家注释。我们的贡献之一是与摄影师合​​作设计的相机运动原始图的分类法。例如,我们发现某些动议(例如“关注”(或跟踪))需要了解诸如移动主题之类的场景内容。我们进行了一项大规模的人类研究,以量化人类注释绩效,揭示了领域的专业知识和基于教程的培训可以显着提高准确性。例如,新手可能会使缩放(内在的变化)与向前的翻译(外部更改)混淆,但可以训练以区分两者。我们使用摄影仪,我们评估结构 - 触发(SFM)和视频语言模型(VLMS),发现SFM模型难以捕获依赖场景内容的语义原始素,而VLMS则难以捕获需要精确估算轨迹的几何原始图。然后,我们在Camerabench上微调了生成的VLM,以实现两全其美的最佳状态并展示其应用程序,包括运动提示字幕,视频询问和视频文本检索。我们希望我们的分类学,基准和教程将推动未来的努力,以了解任何视频中相机动作的最终目标。


通过一个培训示例,在大型语言模型中进行推理的强化学习

  • 标题: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

  • 作者: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen

  • 日期: 2025-04-29

  • ArXiv主页: https://arxiv.org/abs/2504.20571

  • 论文链接: https://arxiv.org/pdf/2504.20571

  • gitHub仓库: https://github.com/ypwang61/One-Shot-RLVR

英文摘要

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the “grokking” phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B’s performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR

中文摘要

我们表明,使用一个培训示例(1-Shot RLVR)通过可验证的奖励进行增强学习可以有效激励大语言模型(LLMS)的数学推理能力。将RLVR应用于基本模型QWEN2.5-MATH-1.5B,我们确定了一个示例,该示例将Math500的模型性能从36.0%提高到73.6%,并将六个常见数学推理基准的平均性能从17.6%提高到35.7%。该结果与使用1.2K DeepScaler子集获得的性能(Math500:73.6%,平均:35.9%),其中包括上述示例。在各种模型(QWEN2.5-MATH-7B,LLAMA3.2-3B-INSTRUCTION,DEEPSEEK-R1-DISTILL-QWEN-1.5B),RL算法(GRPO和PPO)和不同的数学示例(其中许多示例中的数学示例更大改进时,在使用单个训练时会产生30%或更大的改进)。此外,我们在1次RLVR期间确定了一些有趣的现象,包括跨域泛化,自我反射的频率增加以及持续的测试性能提高,即使训练精度饱和,我们将这种现象称为饱和后概括。此外,我们验证了1-Shot RLVR的有效性主要来自政策梯度损失,从而将其与“ Grokking”现象区分开来。我们还显示了在1-Shot RLVR训练中促进探索(例如,通过适当的系数添加熵损失)的关键作用。作为奖励,我们观察到,仅应用熵损失,而没有任何结果奖励,就会显着提高QWEN2.5-MATH-1.5B在Math500上的表现,并提高了27.4%。这些发现可以激发未来关于RLVR数据效率的工作,并鼓励对RLVR中最近的进展和基本机制进行重新检查。我们的代码,模型和数据是https://github.com/ypwang61/oone-shot-rlvr的开源。


排行榜的幻想

  • 标题: The Leaderboard Illusion
  • 作者: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
  • 日期: 2025-04-29
  • ArXiv主页: https://arxiv.org/abs/2504.20879
  • 论文链接: https://arxiv.org/pdf/2504.20879
  • 项目链接: https://cohere.com/research/lmarena

英文摘要

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena’s evaluation framework and promote fairer, more transparent benchmarking for the field

中文摘要

衡量进度是任何科学领域的发展至关重要的。由于基准发挥了越来越核心的作用,因此它们也变得更容易受到失真的影响。Chatbot Arena已成为排名最强大的AI系统的首选排行榜。但是,在这项工作中,我们确定了导致竞争环境扭曲的系统问题。我们发现,未公开的私人测试实践使少数提供商有益于能够在公众发布前测试多个变体的提供商,并根据需要缩回分数。我们确定,由于选择性披露性能结果,这些提供商选择最佳分数的能力会导致有偏见的竞技场得分。在极端情况下,我们在Llama-4发行版中确定了由Meta测试的27个私人LLM变体。我们还确定,以更高的速率(战斗数)采样专有的封闭模型,而从竞技场中删除的模型少于开放式和开源替代方案。这两种策略都会随着时间的流逝而导致大量数据访问不对称。诸如Google和Openai之类的提供商分别获得了竞技场所有数据的19.2%和20.4%。相比之下,合并的83个开放权重模型仅收到了总数据的29.7%。我们表明,访问聊天机器人竞技场数据可带来可观的好处;根据我们的保守估计,即使有限的其他数据也可能导致竞技场分布的相对性能增长高达112%。这些动态共同使特定于竞技场的动力学而不是通用模型质量过度拟合。竞技场基于组织者和一个开放社区的实质性努力,该社区维持这个有价值的评估平台。我们提供可行的建议,以改革聊天机器人体育馆的评估框架,并为该领域提供更公平,更透明的基准测试


Universalrag:以多种方式和粒度的多个语料库的检索来检索一代

  • 标题: UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
  • 作者: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang
  • 日期: 2025-04-29
  • ArXiv主页: https://arxiv.org/abs/2504.20734
  • 论文链接: https://arxiv.org/pdf/2504.20734
  • 项目链接: https://universalrag.github.io/
  • gitHub仓库: https://github.com/wgcyeo/UniversalRAG

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.

中文摘要

检索增强的生成(RAG)通过将模型响应与与查询相关的外部知识接地,在提高事实准确性方面表现出了巨大的希望。但是,大多数现有的破布方法仅限于仅文本语料库,尽管最近的努力将抹布扩展到了其他模态,例如图像和视频,但它们通常在单一特定于模态的语料库中运行。相比之下,实际查询在他们所需的知识类型上差异很大,而这些知识源无法解决。为了解决这个问题,我们介绍了Universalrag,这是一个新型的抹布框架,旨在检索和整合具有不同方式和粒度的异质来源的知识。具体而言,由于观察的动机,即强迫所有模式进入一个统一的表示空间,从单个组合语料库衍生出统一的空间会导致模态差距,在这种情况下,检索倾向于从与查询相同的方式中偏爱物品,我们提出了一种动态意识的路由机制,该方法可以动态地识别最合适的偶发性态度性型族裔和在其范围内执行目标检索。同样,除了方式之外,我们还将每种方式都组织成多个粒度水平,从而可以根据查询的复杂性和范围进行微调的检索。我们验证了跨越多种模态的8个基准测试的Universalrag,显示了其优越性比模式特异性和统一基线的优越性。


WebThinker:具有深厚研究能力的大型推理模型的能力

  • 标题: WebThinker: Empowering Large Reasoning Models with Deep Research Capability
  • 作者: Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21776
  • 论文链接: https://arxiv.org/pdf/2504.21776
  • 项目链接: https://foremost-beechnut-8ed.notion.site/WebThinker-Empowering-Large-Reasoning-Models-with-Deep-Research-Capability-d13158a27d924a4b9df7f9ab94066b64
  • gitHub仓库: https://github.com/RUC-NLPIR/WebThinker

英文摘要

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

中文摘要

大型推理模型(LRMS),例如OpenAI-O1和DeepSeek-R1,表现出令人印象深刻的长途推理能力。但是,他们对静态内部知识的依赖将其表现限制在复杂,知识密集的任务上,并阻碍了他们生成需要综合不同Web信息的全面研究报告的能力。为了解决这个问题,我们提出了WebThinker,这是一位深入的研究代理,授权LRMS自主搜索网络,导航网页并在推理过程中起草研究报告草案。WebThinker集成了深层Web Explorer模块,使LRMS能够在遇到知识差距时动态搜索,导航和提取信息。它还采用了自主的思维搜索和草案策略,使该模型可以无缝交织推理,信息收集和实时报告写作。为了进一步增强研究工具的利用,我们通过迭代在线直接偏好优化(DPO)介绍了基于RL的培训策略。关于复杂推理基准(GPQA,Gaia,Webwalkerqa,HLE)和科学报告生成任务(GLAIVE)的广泛实验表明,WebThinker的表现会大大优于现有方法和强大的专有系统。我们的方法在复杂的场景中增强了LRM的可靠性和适用性,为更有能力和多功能的深层研究系统铺平了道路。该代码可在https://github.com/ruc-nlpir/webthinker上找到。


唤醒:通过小语言模型推进阿拉伯语

  • 标题: Sadeed: Advancing Arabic Diacritization Through Small Language Model
  • 作者: Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21635
  • 论文链接: https://arxiv.org/pdf/2504.21635

英文摘要

Arabic text diacritization remains a persistent challenge in natural language processing due to the language’s morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.

中文摘要

由于语言的形态丰富性,阿拉伯文本的数字化仍然是自然语言处理中的持续挑战。在本文中,我们介绍了一种基于纯正调整解码器的语言模型的新颖方法,该模型改编自Kuwain 1.5B Hennara等人。[2025],一种紧凑型模型,最初是在多种阿拉伯语料库中培训的。通过严格的数据清洁和归一化管道构建的精心策划,高质量的数字数据集进行了微调。尽管利用适中的计算资源,但与专有的大语言模型相比,Sadeed取得了竞争成果,并且优于在类似领域训练的传统模型。此外,我们重点介绍了当前基准测试实践的关键限制。为了解决这些问题,我们介绍了SadeedDiac-25,这是一种新的基准测试,旨在实现各种文本类型和复杂性水平的更公平,更全面的评估。共同,Sadeed和SadeedDiac-25为推进阿拉伯NLP应用程序(包括机器翻译,文本到语音学习和语言学习工具)提供了强大的基础。


Skywork R1V2:推理的多模式混合增强学习

  • 标题: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

  • 作者: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou

  • 日期: 2025-04-23

  • ArXiv主页: https://arxiv.org/abs/2504.16656

  • 论文链接: https://arxiv.org/pdf/2504.16656

  • gitHub仓库: https://github.com/SkyworkAI/Skywork-R1V

英文摘要

We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages’’ dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations–a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2’s superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.

中文摘要

我们提出了Skywork R1V2,这是一种下一代多模式推理模型,并从其前身Skywork R1V出发。R1V2以此为核心引入了混合增强学习范式,该学习范式将奖励模型指导与基于规则的策略进行了协调,从而解决了平衡复杂的推理能力与广泛概括的长期挑战。为了进一步提高训练效率,我们提出了选择性样品缓冲液(SSB)机制,该机制有效地反驳了``消失的优势’‘’'组相对策略优化(GRPO)固有的困境,通过在整个优化过程中优先考虑高价值样本。值得注意的是,我们观察到过多的增强信号会引起视觉幻觉 - 这是我们系统地监测和减轻整个训练过程中校准的奖励阈值的现象。经验结果肯定了R1V2的非凡能力,基准领先的性能,例如奥林匹克班基因上的62.6,AIME2024上的79.0,在livecodebench上为63.6,MMMU上的74.0。这些结果强调了R1V2比现有开源模型的优势,并在缩小Premier专有系统(包括Gemini 2.5和OpenAI O4-Mini)的性能差距方面表现出了重大进展。Skywork R1V2模型权重已公开发布,以促进开放性和可重复性https://huggingface.co/skywork/skywork/skywork-r1v2-38b。


DeepCritic:大型语言模型的故意批评

  • 标题: DeepCritic: Deliberate Critique with Large Language Models

  • 作者: Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen

  • 日期: 2025-05-01

  • ArXiv主页: https://arxiv.org/abs/2505.00662

  • 论文链接: https://arxiv.org/pdf/2505.00662

  • gitHub仓库: https://github.com/RUCBM/DeepCritic

英文摘要

As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

中文摘要

随着大型语言模型(LLM)正在迅速发展,提供准确的反馈和对其产出的可扩展监督成为一个紧迫而关键的问题。利用LLM作为批评模型来实现自动监督是一个有前途的解决方案。在这项工作中,我们专注于研究和增强LLM的数学批评能力。当前的LLM批评家提供的批评在每个步骤上都过于浅,表面太肤浅,导致判断准确性降低,并努力为LLM发电机提供足够的反馈以纠正错误。为了解决这个问题,我们提出了一个新颖而有效的两阶段框架,以开发LLM批评家,能够故意在数学解决方案的每个推理步骤中批评。在第一阶段,我们利用QWEN2.5-72B-INSTRUCT产生4.5K长形的批评作为种子数据进行监督微调。每种种子批评都由有意的逐步批评组成,其中包括多种验证验证以及对每个推理步骤的初始评论的深入批评。然后,我们使用来自PRM800K的现有人类标记的数据或通过基于蒙特卡洛采样的正确性估计获得的自动注释数据进行了微调模型的增强学习,以进一步激励其批评能力。我们开发的批评模型基于QWEN2.5-7B教学,不仅在各种错误识别基准上都显着胜过现有的LLM批评家(包括相同大小的DeepSeek-R1-Distill模型和GPT-4O),而且更有效地帮助LLM Generator Refinator Perfinator Perfinator Perfinator Ristose risone spects通过更详细的反馈。


理性:培训追捕者的推理任务

  • 标题: ReasonIR: Training Retrievers for Reasoning Tasks
  • 作者: Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer
  • 日期: 2025-04-29
  • ArXiv主页: https://arxiv.org/abs/2504.20595
  • 论文链接: https://arxiv.org/pdf/2504.20595

英文摘要

We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

中文摘要

我们提出了Reasonir-8B,这是第一个专门培训一般推理任务的检索员。现有的猎犬在推理任务上显示出有限的收益,部分原因是现有的培训数据集专注于与文档有关的简短事实查询,这些查询与文档直接回答。我们开发了一个合成的数据生成管道,对于每个文档,我们的管道都会创建一个具有挑战性且相关的查询,以及合理的相关但最终无助的硬性负面负面的。通过对我们的合成数据和现有公共数据的混合培训,Reasonir-8b在没有重读的情况下,将新的最新时间为29.9 NDCG@10,而36.9 NDCG@10与Bright上的reranker一起使用,这是一个广泛使用的推理推理密集型信息检索(IR)Benchmark。当应用于RAG任务时,Reasonir-8B相对于闭环基线,将MMLU和GPQA性能分别提高了6.4%和22.6%,表现优于其他猎犬和搜索引擎。此外,Reasonir-8b使用测试时间更有效地计算:在明亮的情况下,其性能始终以较长且更多信息丰富的重写查询来增加;与LLM Reranker结合使用时,它继续优于其他检索器。我们的培训食谱是一般的,可以轻松扩展到未来的LLM;为此,我们开源的代码,数据和模型。


PHI-4改良技术报告

  • 标题: Phi-4-reasoning Technical Report
  • 作者: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21318
  • 论文链接: https://arxiv.org/pdf/2504.21318

英文摘要

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of “teachable” prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

中文摘要

我们介绍了Phi-4-Remounting,这是一个140亿个参数推理模型,在复杂的推理任务上实现了强劲的性能。通过对精心策划的“可教”提示,通过O3-MINI生成的适当的复杂性和多样性和推理示范,通过对PHI-4进行监督的PHI-4进行培训,该提示会生成有效利用推理推力时间计算的详细推理链。我们进一步开发了PHI-4-Remounting-Plus,这是一种通过基于结果的强化学习的短期增强的变体,通过产生更长的推理轨迹,可以提供更高的性能。在广泛的推理任务中,这两种模型的表现都胜过更大的开放权重模型,例如DeepSeek-R1-Distill-lalama-70B模型,并接近完整的DeepSeek-R1模型的性能水平。我们的全面评估涵盖了数学和科学推理,编码,算法问题解决,计划和空间理解的基准。有趣的是,我们也观察到了对通用基准的改进的非平凡转移。在本报告中,我们提供了有关培训数据,培训方法和评估的见解。我们表明,仔细的数据策划对监督微调(SFT)的好处扩展到推理语言模型,并且可以通过增强学习(RL)进一步扩大。最后,我们的评估指出了改善我们评估推理模型的性能和鲁棒性的机会。


PHI-4-MINI-RENOSING:探索数学中小推理语言模型的极限

  • 标题: Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
  • 作者: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21233
  • 论文链接: https://arxiv.org/pdf/2504.21233

英文摘要

Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.

中文摘要

通过训练它们明确生成中间的推理步骤,可以在大语言模型(LLMS)中显着增强正式的推理能力(COT)。尽管LLM很容易从此类技术中受益,但由于其模型有限的能力,小语言模型(SLM)的推理(SLM)仍然具有挑战性。DeepSeek-R1的最新工作表明,从LLM生成的合成数据蒸馏可以大大提高SLM的推理能力。但是,详细的建模配方尚未披露。In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward.我们将方法应用于Phi-4-Mini,这是一种紧凑的3.8B参数模型。在数学推理任务上,最终的PHI-4-MINI-RENOSING模型超过了更大的推理模型,例如,在数学上优于DeepSeek-R1-Distill-Qwen-7b,胜过3.2分,而DeepSeek-R1-Distill-distill-dillama-8b则超过7.7点,在数学500上以7.7分。我们的结果证明,具有大规模高质量COT数据的精心设计的培训配方即使在资源受限的小型模型中也有效地解锁了强大的推理能力。


Bitnet V2:1位LLMS Hadamard转换的本机4位激活

  • 标题: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
  • 作者: Hongyu Wang, Shuming Ma, Furu Wei
  • 日期: 2025-04-25
  • ArXiv主页: https://arxiv.org/abs/2504.18415
  • 论文链接: https://arxiv.org/pdf/2504.18415

英文摘要

Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

中文摘要

1位大语言模型(LLM)的有效部署受到激活异常值的阻碍,这使量化变得复杂到低宽度。我们介绍了Bitnet V2,这是一个新型框架,可实现1位LLM的天然4位激活量化。为了解决注意力和馈送网络激活方面的异常值,我们提出了H-Bitlinear,这是一个在激活量化之前应用在线Hadamard转换的模块。这种转换将尖锐的激活分布平滑为更类似高斯的形式,适用于低位表示。实验表明,比特网V2从头开始训练,其8位激活与BITNET B1.58性能相匹配。至关重要的是,在接受天然4位激活训练时,Bitnet V2可实现最小的性能降解,从而大大降低了批处理推理的记忆足迹和计算成本。


互动生成视频的调查

  • 标题: A Survey of Interactive Generative Video
  • 作者: Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21853
  • 论文链接: https://arxiv.org/pdf/2504.21853

英文摘要

Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.

中文摘要

互动生成视频(IGV)已成为一种至关重要的技术,响应着对各个领域的高质量,交互式视频内容的需求的增长。在本文中,我们将IGV定义为一种结合生成能力的技术,可以生成各种高质量的视频内容和交互式功能,从而通过控制信号和响应反馈来使用户参与度。我们调查了IGV应用程序的当前格局,重点关注三个主要领域:1)游戏,IGV可以在虚拟世界中实现无限探索;2)体现了AI,其中IGV充当了与动态不断发展的场景的多模式相互作用的训练剂的物理学环境合成器;3)自动驾驶,其中IGV提供了闭环模拟功能,可用于安全至关重要的测试和验证。为了指导未来的发展,我们提出了一个综合框架,将理想的IGV系统分解为五个基本模块:生成,控制,记忆,动态和智能。此外,我们在实现理想IGV系统的每个组件时系统地分析了技术挑战和未来的方向,例如实现实时生成,实现开放域控制,保持长期连贯性,模拟准确的物理学以及整​​合出因果推理。我们认为,这种系统的分析将促进IGV领域的未来研究和发展,最终将技术推向了更复杂和更实际的应用。


T2I-R1:通过协作语义级别和令牌级别的COT加强图像生成

  • 标题: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

  • 作者: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li

  • 日期: 2025-05-01

  • ArXiv主页: https://arxiv.org/abs/2505.00703

  • 论文链接: https://arxiv.org/pdf/2505.00703

  • gitHub仓库: https://github.com/CaraJ7/T2I-R1

英文摘要

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

中文摘要

大型语言模型的最新进展已经证明了思想链(COT)和增强学习(RL)如何改善性能。但是,将这种推理策略应用于视觉生成领域仍然没有探索。在本文中,我们提出了T2i-R1,这是一种新型推理增强的文本对图像生成模型,由RL提供双层COT推理过程。具体而言,我们确定了可以利用的两个级别的COT来增强发电的不同阶段:(1)在提示的高级计划的语义级别的COT以及(2)在逐个补丁过程中用于低级像素处理的令牌级别的COT。为了更好地协调这两个级别的COT,我们将BiCot-Grpo与生成奖励合奏一起介绍,该奖励在同一训练步骤中无缝优化了两个生成COTS。通过将我们的推理策略应用于基线模型Janus-Pro,我们在T2i-Companch上提高了13%的绩效,在明智的基准方面提高了13%,甚至超过了最先进的模型Flux.1。代码可在以下网址找到:https://github.com/caraj7/t2i-r1


迈向评估思维:通过不断发展的奖励模型优化的元政策优化

  • 标题: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

  • 作者: Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang

  • 日期: 2025-04-28

  • ArXiv主页: https://arxiv.org/abs/2504.20157

  • 论文链接: https://arxiv.org/pdf/2504.20157

  • gitHub仓库: https://github.com/minnesotanlp/mpo

英文摘要

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model’s prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model’s prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO’s meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

中文摘要

大型语言模型(LLMS)的基于奖励的对准方法面临两个关键局限性:奖励黑客的脆弱性,其中模型在奖励信号中利用缺陷;当使用LLMs用作奖励模型时,依赖脆性,劳动密集型及时工程。我们介绍了元政策优化(MPO),该框架通过整合一个元奖励模型来解决这些挑战,该模型在整个培训过程中都会动态地完善奖励模型的提示。在MPO中,Meta-Reward模型监视了不断发展的培训环境,并不断调整奖励模型的提示,以保持高对齐方式,提供了自适应奖励信号,以抵制该政策的剥削。这种元学习方法促进了更稳定的政策优化,并大大减少了对手动奖励提示设计的需求。它的表现与由广泛手工制作的奖励提示指导的模型相同或更好。此外,我们表明,MPO在不需要专门的奖励设计的情况下保持了跨不同任务的有效性,例如回答和数学推理。除了标准RLAIF之外,MPO的元学习配方很容易扩展到更高级别的对齐框架。总体而言,该方法解决了基于奖励的LLM的RL对齐的理论和实践挑战,为更强大和适应性的一致性策略铺平了道路。代码和模型将公开共享。


DeepSeek-R1之后的100天:关于复制研究和推理语言模型的更多指示的调查

  • 标题: 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models
  • 作者: Chong Zhang, Yue Deng, Xiang Lin, Bin Wang, Dianwen Ng, Hai Ye, Xingxuan Li, Yao Xiao, Zhanfeng Mo, Qi Zhang, Lidong Bing
  • 日期: 2025-05-01
  • ArXiv主页: https://arxiv.org/abs/2505.00551
  • 论文链接: https://arxiv.org/pdf/2505.00551

英文摘要

The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

中文摘要

推理语言模型(RLMS)的最新发展代表了大型语言模型中的一种新颖进化。特别是,DeepSeek-R1的最新发布产生了广泛的社会影响,并激发了研究社区的热情,以探索语言模型的明确推理范式。但是,DeepSeek尚未完全开源,其中包括DeepSeek-R1-Zero,DeepSeek-R1和蒸馏小型型号。结果,已经出现了许多复制研究,旨在重现DeepSeek-R1所取得的强劲表现,通过类似的培训程序和完全开源的数据资源达到可比的性能。这些作品研究了可行的策略(SFT),并从可验证的奖励(RLVR)中进行了强化学习,重点是数据准备和方法设计,从而产生了各种有价值的见解。在本报告中,我们提供了近期复制研究的摘要,以激发未来的研究。我们主要关注SFT和RLVR作为两个主要方向,介绍了当前复制研究的数据构建,方法设计和培训程序的详细信息。此外,我们从这些研究报告的实施细节和实验结果中得出结论,预计会激发未来的研究。我们还讨论了增强RLM的其他技术,突出了扩大这些模型的应用范围的潜力,并讨论了开发中的挑战。通过这项调查,我们旨在帮助RLMS的研究人员和开发人员随着最新进步的最新进展,并寻求激发新的想法以进一步增强RLM。


软件:没有注意力降低,没有校正软的巨大激活

  • 标题: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

  • 作者: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

  • 日期: 2025-04-29

  • ArXiv主页: https://arxiv.org/abs/2504.20966

  • 论文链接: https://arxiv.org/pdf/2504.20966

  • gitHub仓库: https://github.com/zaydzuhri/softpick-attention

英文摘要

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention.

中文摘要

我们介绍了在变压器注意机制中的软键,这是一种纠正的,而不是总和替换,可消除注意力降低和大量激活。我们使用340m参数模型进行的实验表明,软饼在标准基准测试的同时保持了SoftMax的性能均衡,同时达到了0%的接收率。软键变压器产生的峰值较低(340 vs 33,510)的隐藏状态,并产生稀疏的注意力图(46.97%的稀疏性)。量化时,使用软件的模型始终优于SoftMax,在较低的位精确度下特别明显的优势。我们的分析和讨论表明,软饼有可能为量化,低精度训练,稀疏性优化,修剪和解释性打开新的可能性。我们的代码可在https://github.com/zaydzuhri/softpick-prestion上找到。


reptext:通过复制渲染视觉文本

  • 标题: RepText: Rendering Visual Text via Replicating
  • 作者: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen
  • 日期: 2025-04-28
  • ArXiv主页: https://arxiv.org/abs/2504.19724
  • 论文链接: https://arxiv.org/pdf/2504.19724
  • 项目链接: https://reptext.github.io/
  • gitHub仓库: https://github.com/Shakker-Labs/RepText

英文摘要

Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

中文摘要

尽管当代文本到图像生成模型在产生视觉上吸引人的图像方面取得了显着突破,但它们产生精确和灵活的印刷元素(尤其是非拉丁蛋白字母)的能力仍然受到限制。为了解决这些限制,我们从一个天真的假设开始,即文本理解只是文本渲染的足够条件,但不是必要的条件。基于此,我们介绍了reptext,该reptext旨在使预训练的单语文本对图像生成模型具有准确渲染或更精确的能力,或者更精确地复制,以用户指定的字体复制,多语言的视觉文本,而无需真正理解它们。具体来说,我们采用了ControlNet的设置,并集成了语言不可知论的字形和渲染文本的位置以启用生成统一的视觉文本,从而使用户可以自定义文本内容,字体和位置,并满足其需求。为了提高准确性,与扩散损失一起使用文本感知损失。此外,为了稳定渲染过程,在推论阶段,我们直接以嘈杂的字形潜在而不是随机初始化初始化,并采用区域掩码以将特征注入限制为仅到文本区域,以避免背景变形。我们进行了广泛的实验,以验证相对于现有作品的reptext的有效性,我们的方法的表现优于现有的开源方法,并且取得了与本机多语言封闭源模型的可比结果。更公平,我们还详尽地讨论了它的局限性。


紧凑:组成原子能到复杂的视觉能力调整

  • 标题: COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
  • 作者: Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky
  • 日期: 2025-04-30
  • ArXiv主页: https://arxiv.org/abs/2504.21850
  • 论文链接: https://arxiv.org/pdf/2504.21850
  • 项目链接: https://princetonvisualai.github.io/compact/
  • gitHub仓库: https://github.com/princetonvisualai/compact

英文摘要

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

中文摘要

多模式的大语言模型(MLLM)在简单的视觉语言任务上表现出色,但面对需要多个功能的复杂任务,例如同时识别对象,计算它们并了解其空间关系。这可能是一个事实的结果,即MLLMS的关键训练步骤(VIT)传统上专注于扩展数据量,而不是训练示例的组成复杂性。我们提出了紧凑的(组成原子能 - 复合能力调整),该调整会生成一个训练数据集,以明确控制训练示例的组成复杂性。来自Compact的数据使MLLM可以训练原子能力的组合,以更有效地学习复杂能力。在所有基准测试中,紧凑型的性能与LLAVA-665K VIT相当,同时使用其数据预算的10%,甚至在几个涉及复杂的多重能力任务方面的表现都超过了几个。例如,与完全复杂的问题相比,紧凑型MMSTAR的MMSTAR提高了83.3%,MM-VET提高了94.0%的改善,而这需要四个或更多的原子能力。Compact提供了可扩展的,数据效率的,视觉组成调整配方,以改进复杂的视觉语言任务。


自我生成的文化示例改善了LLM代理的顺序决策任务

  • 标题: Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks
  • 作者: Vishnu Sarukkai, Zhiqiang Xie, Kayvon Fatahalian
  • 日期: 2025-05-01
  • ArXiv主页: https://arxiv.org/abs/2505.00234
  • 论文链接: https://arxiv.org/pdf/2505.00234

英文摘要

Many methods for improving Large Language Model (LLM) agents for sequential decision-making tasks depend on task-specific knowledge engineering–such as prompt tuning, curated in-context examples, or customized observation and action spaces. Using these approaches, agent performance improves with the quality or amount of knowledge engineering invested. Instead, we investigate how LLM agents can automatically improve their performance by learning in-context from their own successful experiences on similar tasks. Rather than relying on task-specific knowledge engineering, we focus on constructing and refining a database of self-generated examples. We demonstrate that even a naive accumulation of successful trajectories across training tasks boosts test performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode-SQL (75% to 79%)–matching the performance the initial agent achieves if allowed two to three attempts per task. We then introduce two extensions: (1) database-level selection through population-based training to identify high-performing example collections, and (2) exemplar-level selection that retains individual trajectories based on their empirical utility as in-context examples. These extensions further enhance performance, achieving 91% on ALFWorld–matching more complex approaches that employ task-specific components and prompts. Our results demonstrate that automatic trajectory database construction offers a compelling alternative to labor-intensive knowledge engineering.

中文摘要

用于改进大型语言模型(LLM)代理的许多方法用于顺序决策任务取决于特定于任务的知识工程 - 例如及时调整,策划的中文示例或自定义的观察和动作空间。使用这些方法,代理绩效随着投资的知识工程质量或数量而提高。取而代之的是,我们研究了LLM代理如何通过在类似任务上的成功经验中学习中下文来自动提高其性能。我们专注于构建和完善自我生成的示例数据库,而不是依靠特定于任务的知识工程。我们证明,即使跨训练任务的成功轨迹的幼稚积累也可以在三个基准上提高测试性能:Alfworld(73%至89%),Wordcraft(55%至64%)和Intercode-SQL(75%至79%)(79%至79%) - 如果允许最初的代理人达到两次到三个尝试,则可以通过任务进行两次尝试。然后,我们介绍了两个扩展:(1)通过基于人群的培训进行数据库级选择,以识别高性能的示例集合,以及(2)示例级别的选择,这些选择会根据其经验实用性作为封闭式示例来保留单个轨迹。这些扩展进一步提高了性能,在Alfworld上获得了91%的匹配 - 采用更复杂的方法采用了特定于任务的组件和提示。我们的结果表明,自动轨迹数据库构建为劳动密集型知识工程提供了令人信服的替代方法。


LLMS中的临床知识不会转化为人类互动

  • 标题: Clinical knowledge in LLMs does not translate to human interactions
  • 作者: Andrew M. Bean, Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, Adam Mahdi
  • 日期: 2025-04-26
  • ArXiv主页: https://arxiv.org/abs/2504.18919
  • 论文链接: https://arxiv.org/pdf/2504.18919
  • 项目链接: https://github.com/am-bean/HELPMed
  • gitHub仓库: https://github.com/am-bean/HELPMed

英文摘要

Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.

中文摘要

全球医疗保健提供者正在探索使用大型语言模型(LLM)向公众提供医疗建议。LLMS现在在医学许可考试中获得了几乎完美的分数,但这并不一定转化为现实世界中的准确表现。我们测试了LLMS是否可以帮助公众识别潜在条件并在1,298名参与者的对照研究中选择十个医疗方案的基本条件(处置)。参与者被随机分配以获得LLM(GPT-4O,Llama 3,命令R+)或他们选择的来源(控制)的帮助。在单独测试的情况下,LLM准确地完成了情景,正确识别94.9%的病例中的条件,平均为56.3%的情况。但是,使用相同LLM的参与者在不到34.5%的病例中识别出相关条件,而在不到44.2%的情况下,则识别了相关条件,两者都不比对照组更好。我们将用户互动确定为对LLM的部署进行医疗建议的挑战。用于医学知识和模拟患者互动的标准基准不能预测我们与人类参与者的失败。展望未来,我们建议系统性的人类用户测试,以评估医疗保健公共部署之前的交互式功能。


除了最后一个答案之外:您的推理痕迹比您想象的要多

  • 标题: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
  • 作者: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
  • 日期: 2025-04-29
  • ArXiv主页: https://arxiv.org/abs/2504.20708
  • 论文链接: https://arxiv.org/pdf/2504.20708
  • 项目链接: https://hammoudhasan.github.io/SubthoughtReasoner/
  • gitHub仓库: https://github.com/hammoudhasan/SubthoughtReasoner

英文摘要

Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model’s optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model’s confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13% and 10% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.

中文摘要

大型语言模型(LLMS)利用逐步推理来解决复杂的问题。标准评估实践涉及产生完整的推理跟踪,并评估其结论中最终答案的正确性。在本文中,我们提出以下两个问题来挑战对最终答案的依赖:最终答案是否可靠地代表了模型的最佳结论?替代推理路径可以产生不同的结果吗?为了回答这些问题,我们分析了中间推理步骤,称为sudthoughts,并根据我们的发现提出了一种方法。我们的方法涉及根据语言提示将推理轨迹分割为顺序的次思想。首先,我们提示该模型从每个中间概念的终点产生连续性。我们从源自不同子思想的每个完整延续中提取一个潜在的答案。我们发现,与仅依赖于原始完整痕迹得出的答案相比,选择最常见的答案(该模式)通常会产生的准确性明显更高。分析从不同的概念中得出的答案之间的一致性揭示了与模型的信心和正确性相关的特征,这表明了识别较不可靠答案的潜力。我们在各种LLM和具有挑战性的数学推理数据集(AIME2024和AIME2025)中进行的实验表现出一致的准确性提高,分别提高到13 \%和10 \%。实施可在以下网址获得:https://github.com/hammoudhasan/subthoughtreasoner。


Videovista culturallingo:360^Circ Horizo​​ns-Bridging文化,语言和视频理解中的领域

  • 标题: VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
  • 作者: Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
  • 日期: 2025-04-23
  • ArXiv主页: https://arxiv.org/abs/2504.17821
  • 论文链接: https://arxiv.org/pdf/2504.17821
  • 项目链接: https://videovista-culturallingo.github.io/
  • gitHub仓库: https://github.com/HITsz-TMG/VideoVista

英文摘要

Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

中文摘要

评估多模式AI系统的视频理解能力可以有效地衡量其理解和推理能力。大多数视频评估基准都仅限于一种单一语言,通常是英语,并且主要是植根于西方文化背景的视频。在本文中,我们介绍了Videovista Culturallingo,这是第一个旨在弥合视频理解中文化,语言和领域划分的视频评估基准。我们的工作与现有基准不同的方式不同:1)文化多样性,结合了来自中国,北美和欧洲的文化;2)多语言学,以最广泛的语言提出了中文和英语的问题;3)广阔的领域,其中包含来自数百个人类创建的领域的视频。Videovista corulturallingo包含1,389个视频和3,134对QA对,我们评估了24个最近的开源或专有视频大型模型。从实验结果中,我们观察到:1)现有模型在以中国为中心的问题上的表现差,尤其是与中国历史有关的模型;2)当前的开源模型仍然在时间理解中表现出局限性,尤其是在事件定位任务中,最大得分仅为45.2%;3)主流模型在一般科学问题中表现出强烈的表现,而开源模型在数学中表现出较弱的性能。


电话自动化中的LLM驱动GUI代理:调查进度和前景

  • 标题: LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
  • 作者: Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li
  • 日期: 2025-04-28
  • ArXiv主页: https://arxiv.org/abs/2504.19838
  • 论文链接: https://arxiv.org/pdf/2504.19838
  • 项目链接: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents
  • gitHub仓库: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

英文摘要

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.

中文摘要

随着大语言模型(LLM)的迅速增长,电话自动化发生了变化的变化。本文系统地回顾了LLM驱动的电话GUI代理,突出了它们从基于脚本的自动化到智能自适应系统的演变。我们首先将关键挑战的背景化,(i)有限的一般性,(ii)高维护开销和(iii)意图理解较弱,并通过先进的语言理解,多模式的认知和强大的决策来展示LLM如何解决这些问题。然后,我们提出了一种涵盖基本代理框架(单一代理,多代理,计划随后),建模方法(及时工程,基于培训)以及基本数据集和基准测试的分类学。此外,我们详细介绍了特定于任务的体系结构,监督微调以及加强用户意图和GUI操作的强化学习策略。最后,我们讨论了诸如数据集多样性,设备部署效率,以用户为中心的适应和安全问题之类的开放挑战,从而为这个快速发展的领域提供了前瞻性的见解。通过提供结构化的概述并确定紧迫的研究差距,本文为寻求利用LLMS设计可扩展,用户友好的手机GUI代理的研究人员和从业人员提供了明确的参考。


Tesseract:学习4D体现世界模型

  • 标题: TesserAct: Learning 4D Embodied World Models
  • 作者: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
  • 日期: 2025-04-29
  • ArXiv主页: https://arxiv.org/abs/2504.20995
  • 论文链接: https://arxiv.org/pdf/2504.20995

英文摘要

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent’s actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

中文摘要

本文提出了一种学习新型4D体现世界模型的有效方法,该方法预测了3D场景随时间的动态演变,以响应体现的代理的动作,从而提供了空间和时间的一致性。我们建议通过对RGB-DN(RGB,DEPTH和正常)视频进行培训来学习4D世界模型。这不仅通过将详细的形状,配置和时间更改纳入其预测中,超过了传统的2D模型,而且还使我们能够有效地学习具有体现的代理的准确的逆动力学模型。具体来说,我们首先将现有的机器人操纵视频数据集扩展到利用现成模型的深度和正常信息。接下来,我们在此注释的数据集中微调一个视频生成模型,该模型共同预测每个帧的RGB-DN(RGB,DEPTH和正常)。然后,我们提出了一种将生成的RGB,深度和普通视频直接转换为世界高质量4D场景的算法。我们的方法可确保从体现的情景中进行4D场景预测中的时间和空间连贯性,启用针对体现环境的新型视图综合,并促进了策略学习,从而极大地超过了从先前基于视频的世界模型中得出的综合。


http://www.dtcms.com/a/394416.html

相关文章:

  • 3D视觉——求出目标物体在相机坐标系下的位姿信息
  • 固态和机械硬盘损坏后的不同
  • Linux 基础IO
  • pandawiki ai 无法生成摘要
  • m语言可视化log中的变量信息
  • MySQL:库操作和常用数据类型
  • uniapp实现view块级元素横竖屏切换
  • 【编号74】河北地理基础数据(道路、水系、四级行政边界、地级城市、DEM等)
  • Python: 将wxauto发布为接口,并部署为Windows服务
  • 2025年度SEO优化公司
  • 基于Markdown的静态网站生成器完全指南
  • hot100——第十一周
  • 嵌入式(2)——HAL_GetTick()
  • 《第18课——C语言结构体:从Java的“豪华别墅“到C的“集装箱宿舍“——内存对齐、位域抠门与指针破门的底层狂欢》
  • 旅游线路预约小程序怎么搭建?景区售票团购小程序怎么做?
  • Redis未来发展趋势:技术演进与生态展望
  • 怎么重新映射windows终端的按键的功能
  • 【秋招笔试】2025.09.20哔哩哔哩秋招笔试真题
  • string 容器
  • MySQL零基础学习Day1——安装与配置
  • mysql重启,服务器计划重启,如何优雅地停止MySQL?
  • 源码加密知识产权(二) JS压缩和加密——东方仙盟元婴期
  • ​​[硬件电路-308]:双通道通用比较器TC75W57FK 功能概述与管脚定义
  • 华为MindIE 推理引擎:架构解析
  • 使用 modelscope gpu 跑通第一个 cuda 入门实例
  • Agent实战02-agent入门案例LlamaIndex
  • 微服务基础1-微服务拆分与服务调用
  • http、UDP协议
  • 微服务基础3-服务保护与分布式事务
  • C++红黑树详解