当前位置：首页 > news >正文

【论文速递】2025年第13周(Mar-23-29)(Robotics/Embodied AI/LLM)

news 2025/9/22 10:44:47

中文使用 googletrans 翻译，翻译不对的地方以英文为准

QWEN2.5-OMNI技术报告
- 英文摘要
- 中文摘要
我在这里介绍了所有基础：通过稀疏自动编码器解释大语言模型中的推理功能
- 英文摘要
- 中文摘要
视频-T1：视频生成的测试时间缩放
- 英文摘要
- 中文摘要
大型语言模型代理：有关方法，应用和挑战的调查
- 英文摘要
- 中文摘要
Video-R1：在MLLM中加强视频推理
- 英文摘要
- 中文摘要
基于临近帧预测的长期文本自回旋视频建模
- 英文摘要
- 中文摘要
当较少足够的时候：自适应令牌降低以进行有效的图像表示
- 英文摘要
- 中文摘要
UI-R1：通过增强学习增强GUI代理的动作预测
- 英文摘要
- 中文摘要
位置：互动生成视频作为下一代游戏引擎
- 英文摘要
- 中文摘要
WAN：开放和高级的大型视频生成模型
- 英文摘要
- 中文摘要
Gemma 3技术报告
- 英文摘要
- 中文摘要
地图：一个基于七大人格和苏格拉底式多模式科学问题解决的多代理框架
- 英文摘要
- 中文摘要
DITA：通才视觉语言行动政策的扩展扩散变压器
- 英文摘要
- 中文摘要
对长上下文语言建模的全面调查
- 英文摘要
- 中文摘要
开放式搜索：通过开源推理代理商将搜索民主化搜索
- 英文摘要
- 中文摘要
火星：一个多代理框架，纳入了自动化及时优化的苏格拉底指南
- 英文摘要
- 中文摘要
将视力预先培训缩放到4K分辨率
- 英文摘要
- 中文摘要
RoboFactory：探索具有组成约束的体现代理协作
- 英文摘要
- 中文摘要
挑战推理的界限：大型语言模型的奥林匹克级数学基准
- 英文摘要
- 中文摘要
修改大型语言模型后培训以进行多样化的创意写作
- 英文摘要
- 中文摘要
乐高点：多步空间推理的MLLM的好处有多好？
- 英文摘要
- 中文摘要
通过随机产生和滚动预算强迫进行流程模型的推理时间缩放
- 英文摘要
- 中文摘要
桥接连续和离散令牌以进行自回归视觉生成
- 英文摘要
- 中文摘要
VBENCH-2.0：推进视频生成基准套件的内在忠诚
- 英文摘要
- 中文摘要
在视频理解中探索大型多模型的幻觉：基准，分析和缓解
- 英文摘要
- 中文摘要
Simpler-ZOO：研究和驯服零加强学习的开放基础模型
- 英文摘要
- 中文摘要

QWEN2.5-OMNI技术报告

标题: Qwen2.5-Omni Technical Report
作者: Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin
日期: 2025-03-26
ArXiv主页: https://arxiv.org/abs/2503.20215
论文链接: https://arxiv.org/pdf/2503.20215
项目链接: https://qwenlm.github.io/blog/qwen2.5-omni/
gitHub仓库: https://github.com/QwenLM/Qwen2.5-Omni

英文摘要

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni’s performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

中文摘要

在本报告中，我们提出了QWEN2.5-OMNI，这是一种端到端的多模型模型，旨在感知各种方式，包括文本，图像，音频和视频，同时以流方式生成文本和自然语音响应。为了启用多模式信息输入的流，音频和视觉编码器都采用了块处理方法。为了将视频输入的时间戳与音频同步，我们以交织方式依次组织音频和视频，并提出了一种新颖的位置嵌入方法，称为TMrope（时间平行的多模式绳）。为了同时生成文本和语音，同时避免了两种方式之间的干扰，我们提出了思想家 - 谈话架构。在此框架中，思想家充当一个由文本生成的大型语言模型，而Talker是一种双轨自动回归模型，直接利用思想家的隐藏表示形式来产生音频令牌作为输出。思想家和谈话者模型都被设计为以端到端的方式进行培训和推断。为了以流方式解码音频令牌，我们引入了一个滑动窗口dit，该窗口限制了接受场，旨在减少初始包装延迟。QWEN2.5-OMNI与类似尺寸的QWEN2.5-VL相当，并且胜过Qwen2-Audio。此外，Qwen2.5-omni在诸如Omni Bench之类的多模式基准上实现了最先进的性能。值得注意的是，以下Qwen2.5-omni在端到端语音说明中的性能与文本输入的能力相当，如MMLU和GSM8K等基准所证明。至于语音产生，QWEN2.5-OMNI的流式说话者在鲁棒和自然性方面优于大多数现有的流媒体和非流式替代方案。

我在这里介绍了所有基础：通过稀疏自动编码器解释大语言模型中的推理功能

标题: I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
作者: Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
日期: 2025-03-24
ArXiv主页: https://arxiv.org/abs/2503.18878
论文链接: https://arxiv.org/pdf/2503.18878
gitHub仓库: https://github.com/AIRI-Institute/SAE-Reasoning

英文摘要

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ‘‘reasoning features’’ from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model’s reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

中文摘要

大型语言模型（LLM）在自然语言处理中取得了巨大的成功。最近的进步导致了新的推理LLM的发展。例如，开源DeepSeek-R1通过整合深思熟虑和复杂的推理来实现最先进的表现。尽管有这些令人印象深刻的功能，但这种模型的内部推理机制仍未得到探索。在这项工作中，我们采用了稀疏的自动编码器（SAE），这是一种学习神经网络潜在表示稀疏分解为可解释功能的方法，以识别在DeepSeek-R1系列模型中推动推理的功能。首先，我们提出了一种从SAE表示中提取候选人“推理功能”的方法。我们通过经验分析和可解释性方法来验证这些特征，证明它们与模型的推理能力的直接相关性。至关重要的是，我们证明了转向这些功能有系统地提高推理性能，从而提供了LLMS中推理的第一个机理说明。可在https://github.com/airi-institute/sae-reasoning上找到代码

视频-T1：视频生成的测试时间缩放

标题: Video-T1: Test-Time Scaling for Video Generation
作者: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
日期: 2025-03-24
ArXiv主页: https://arxiv.org/abs/2503.18942
论文链接: https://arxiv.org/pdf/2503.18942
项目链接: https://liuff19.github.io/Video-T1/
gitHub仓库: https://github.com/liuff19/Video-T1

英文摘要

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

中文摘要

随着增加培训数据，模型大小和计算成本的规模能力，视频生成在数字创建方面取得了令人印象深刻的结果，使用户能够在各个领域表达创造力。最近，大型语言模型（LLMS）的研究人员将缩放扩展到测试时间，通过使用更多的推理时间计算可以显着提高LLM性能。我们没有通过昂贵的培训成本来扩展视频基础模型，而是探索视频生成中的测试时间扩展（TTS）的力量，旨在回答以下问题：如果允许视频生成模型使用非平凡的推理时间计算，则可以在具有挑战性文本提示的情况下提高生成质量。在这项工作中，我们将视频生成的测试时间缩放为搜索问题，以采样从高斯噪声空间到目标视频分布的更好轨迹。具体来说，我们使用测试时间验证器构建搜索空间，以提供反馈和启发式算法来指导搜索过程。给定文本提示，我们首先通过在推理时增加噪声候选者来探索直观的线性搜索策略。由于全步授予所有帧同时需要大量的测试时间计算成本，因此我们进一步设计了一种更有效的TTS方法，用于视频生成，称为框架树（TOF），以自动性的方式自适应地扩展和修剪视频分支。关于文本条件的视频生成基准测试的广泛实验表明，增加测试时间的计算始终导致视频质量的显着改善。项目页面：https：//liuff19.github.io/video-t1

大型语言模型代理：有关方法，应用和挑战的调查

标题: Large Language Model Agent: A Survey on Methodology, Applications and Challenges
作者: Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming Zhang
日期: 2025-03-27
ArXiv主页: https://arxiv.org/abs/2503.21460
论文链接: https://arxiv.org/pdf/2503.21460
项目链接: https://huggingface.co/spaces/luojunyu/Agent-Papers
gitHub仓库: https://github.com/luo-junyu/Awesome-Agent-Papers

英文摘要

The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.

中文摘要

智能代理商的时代来自我们，这是由大语言模型的革命进步驱动的。具有目标驱动行为和动态适应能力的大型语言模型（LLM）代理可能代表了人工通用智能的关键途径。该调查通过以方法论为中心的分类法，将建筑基础，协作机制和进化途径联系起来，系统地解构了LLM代理系统。我们通过揭示代理设计原理与其在复杂环境中的新兴行为之间的基本联系来统一零散的研究线程。我们的工作提供了统一的建筑观点，研究了代理的构造方式，如何协作以及它们如何随着时间的流逝而发展，同时还解决了评估方法，工具应用程序，实践挑战和不同的应用程序领域。通过调查这个快速发展的领域中的最新发展，我们为研究人员提供了一种结构化的分类法，以了解LLM代理并确定未来研究的有希望的方向。该系列可从https://github.com/luo-junyu/awesome-agent-papers获得。

Video-R1：在MLLM中加强视频推理

标题: Video-R1: Reinforcing Video Reasoning in MLLMs
作者: Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, Xiangyu Yue
日期: 2025-03-27
ArXiv主页: https://arxiv.org/abs/2503.21776
论文链接: https://arxiv.org/pdf/2503.21776
gitHub仓库: https://github.com/tulerfeng/Video-R1

英文摘要

Inspired by DeepSeek-R1’s success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.

中文摘要

受到DeepSeek-R1通过基于规则的强化学习（RL）提高推理能力的成功的启发，我们引入了Video-R1，这是首次尝试系统地探索R1范式，以在多模式大语言模型（MLLMS）中引起视频推理。但是，直接将RL培训与GRPO算法应用于视频推理，提出了两个主要挑战：（i）缺乏视频推理的时间建模，以及（ii）缺乏高质量的视频 - 理论数据。为了解决这些问题，我们首先提出了T-GRPO算法，该算法鼓励模型在视频中利用时间信息进行推理。此外，我们不仅依靠视频数据，而是将高质量的图像策划数据纳入培训过程中。我们已经构建了两个数据集：用于SFT冷启动的Video-R1-COT-165K和用于RL培训的Video-R1-260K，包括图像和视频数据。实验结果表明，Video-R1在视频推理基准测试（如VideoMMMU和VSI-Bench）以及通用视频基准测试（包括MVBench和TempCompass等）上均取得了显著提升。值得注意的是，Video-R1-7B在视频空间推理基准测试VSI-Bench上的准确率达到了35.8%，超过了商业专有模型GPT-4o。所有代码、模型和数据均已发布。

基于临近帧预测的长期文本自回旋视频建模

标题: Long-Context Autoregressive Video Modeling with Next-Frame Prediction
作者: Yuchao Gu, Weijia Mao, Mike Zheng Shou
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19325
论文链接: https://arxiv.org/pdf/2503.19325
项目链接: https://farlongctx.github.io/
gitHub仓库: https://github.com/showlab/FAR

英文摘要

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

中文摘要

长篇小说自回旋建模具有显着高级的语言生成，但是视频生成仍在努力充分利用扩展的时间上下文。为了调查长篇小说视频建模，我们介绍了框架自回归（FAR），这是视频自回归建模的强大基线。就像语言模型学习令牌（即令牌AR）之间的因果关系一样，远处的临时因果关系依赖性在连续框架之间的依赖性，从而获得了比令牌AR和视频扩散变压器更好的收敛性。在远处，我们观察到，由于视觉冗余而导致的长篇文化视觉建模面临挑战。现有绳索缺乏远程上下文的有效时间衰减，并且无法很好地推断出长时间的视频序列。此外，随着视觉令牌的增长速度比语言令牌快得多，对长视频的培训在计算上是昂贵的。为了解决这些问题，我们建议平衡当地和远程依赖性。我们介绍了一种测试时间技术，它为绳索增加了灵活的时间衰减，从而使外推到16倍更长的视力上下文。此外，我们提出了长期的短期上下文建模，其中高分辨率的短期上下文窗口确保了细粒度的时间一致性，而无限的长期上下文窗口则使用较少的令牌来编码长距离信息。通过这种方法，我们可以使用可管理的令牌上下文长度进行长时间的视频序列训练。我们证明，远距离短期和长时间发电的远面表现，为视频自动回归建模提供了简单而有效的基线。

当较少足够的时候：自适应令牌降低以进行有效的图像表示

标题: When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
作者: Eduard Allakhverdov, Elizaveta Goncharova, Andrey Kuznetsov
日期: 2025-03-20
ArXiv主页: https://arxiv.org/abs/2503.16660
论文链接: https://arxiv.org/pdf/2503.16660

英文摘要

Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

中文摘要

视觉编码通常会产生大量的视觉令牌，提供信息丰富的表示形式，但大大增加了计算需求。这就提出了一个问题，即所有产生的令牌是否同样有价值，或者是否可以丢弃其中的一些代币以降低计算成本而不会损害质量。在本文中，我们介绍了一种新方法来确定功能实用程序的基础，即可以从更有价值的功能中重建较低的价值功能。我们通过将自动编码器与Gumbel-Softmax选择机制集成，从而实现此概念，该机制允许仅识别和保留最有用的视觉令牌。为了验证我们的方法，我们使用我们的方法和随机选择的功能选择的功能比较了LLAVA-NEXT模型的性能。我们发现，在基于OCR的任务上，可以以最小的性能损失来消除超过50％的视觉上下文，而随机丢弃相同比例的功能会显着影响模型功能。此外，在一般域任务中，即使仅随机保留30％的令牌可以实现与使用完整的视觉令牌相当的性能。我们的结果突出了一个有希望的自适应和高效多模式修剪的方向，可促进可扩展和低超过的推理而不会损害性能。

UI-R1：通过增强学习增强GUI代理的动作预测

标题: UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
作者: Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li
日期: 2025-03-27
ArXiv主页: https://arxiv.org/abs/2503.21620
论文链接: https://arxiv.org/pdf/2503.21620
项目链接: https://yxchai.com/UI-R1/
gitHub仓库: https://github.com/lll6gg/UI-R1

英文摘要

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.

中文摘要

最近的DeepSeek-R1通过加强学习（RL）和基于规则的奖励展示了LLMS推理能力的出现。在这个想法的基础上，我们是第一个探索基于规则的RL如何增强图形用户界面（GUI）操作预测任务的多模式大语言模型（MLLM）的推理功能。为此，我们策划了136个具有挑战性的任务的小型但高质量的数据集，其中包括移动设备上的五种常见动作类型。我们还引入了基于规则的行动奖励，从而通过基于策略的算法（例如组相对策略优化（GRPO））来实现模型优化。实验结果表明，我们提出的数据有效模型UI-R1-3B在内域（ID）和室外（OOD）任务上都取得了重大改进。具体而言，与基本模型相比，在ID基准AndroidControl上，动作类型的准确性提高了15％，而接地精度则增加了10.3％（即QWEN2.5-VL-3B）。在OOD GUI接地基准ScreenPot-Pro上，我们的模型超过了6.0％的基本模型，并通过较大的模型（例如OS-ATLAS-7B）实现了竞争性能，该模型通过76K数据进行了监督的微调（SFT）培训。这些结果强调了基于规则的强化学习以提高GUI理解和控制的潜力，为该领域的未来研究铺平了道路。

位置：互动生成视频作为下一代游戏引擎

标题: Position: Interactive Generative Video as Next-Generation Game Engine
作者: Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
日期: 2025-03-21
ArXiv主页: https://arxiv.org/abs/2503.17359
论文链接: https://arxiv.org/pdf/2503.17359

英文摘要

Modern game development faces significant challenges in creativity and cost due to predetermined content in traditional game engines. Recent breakthroughs in video generation models, capable of synthesizing realistic and interactive virtual environments, present an opportunity to revolutionize game creation. In this position paper, we propose Interactive Generative Video (IGV) as the foundation for Generative Game Engines (GGE), enabling unlimited novel content generation in next-generation gaming. GGE leverages IGV’s unique strengths in unlimited high-quality content synthesis, physics-aware world modeling, user-controlled interactivity, long-term memory capabilities, and causal reasoning. We present a comprehensive framework detailing GGE’s core modules and a hierarchical maturity roadmap (L0-L4) to guide its evolution. Our work charts a new course for game development in the AI era, envisioning a future where AI-powered generative systems fundamentally reshape how games are created and experienced.

中文摘要

由于传统游戏引擎的预定内容，现代游戏开发在创造力和成本方面面临重大挑战。视频生成模型的最新突破，能够综合现实和互动的虚拟环境，为革新游戏创作提供了机会。在该职位论文中，我们提出了交互式生成视频（IGV）作为生成游戏引擎（GGE）的基础，从而在下一代游戏中实现了无限的新颖内容。GGE在无限的高质量内容综合，物理感知世界建模，用户控制的互动，长期记忆能力和因果推理中利用IGV的独特优势。我们提出了一个综合框架，详细介绍了GGE的核心模块和一个分层的成熟路线图（L0-L4），以指导其发展。我们的工作为AI时代的游戏开发绘制了新的课程，并设想了一个未来的AI驱动生成系统从根本上重塑游戏的创建和体验。

WAN：开放和高级的大型视频生成模型

标题: Wan: Open and Advanced Large-Scale Video Generative Models
作者: WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
日期: 2025-03-26
ArXiv主页: https://arxiv.org/abs/2503.20314
论文链接: https://arxiv.org/pdf/2503.20314

英文摘要

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model’s performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

中文摘要

该报告介绍了Wan，这是一套全面的视频基础模型，旨在突破视频生成的界限。WAN建立在主流扩散变压器范式的基础上，通过一系列创新，包括我们的新型VAE，可扩展的预训练策略，大规模数据策划和自动化评估指标，从而在生成能力方面取得了重大进步。这些贡献共同提高了模型的性能和多功能性。具体而言，WAN的特点是四个关键特征：领先性能：WAN的14B模型，在包含数十亿张图像和视频的庞大数据集中训练，展示了有关数据和模型大小的视频生成规律定律。它始终优于现有的开源模型以及多个内部和外部基准的最先进的商业解决方案，表明了明确而重要的性能优势。全面性：WAN提供了两个有能力的模型，即1.3b和14b参数，分别为效率和有效性。它还涵盖了多个下游应用程序，包括图像到视频，指导引导的视频编辑以及个人视频生成，最多涵盖了八项任务。消费级效率：1.3B模型表现出卓越的资源效率，仅需要8.19 GB VRAM，使其与广泛的消费级GPU兼容。开放性：我们开源整个WAN系列，包括源代码和所有模型，以促进视频生成社区的增长。这种开放旨在显着扩大行业视频制作的创造性可能性，并为学术界提供高质量的视频基础模型。所有代码和模型均可在https://github.com/wan-video/wan2.1上找到。

Gemma 3技术报告

标题: Gemma 3 Technical Report
作者: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini, Jean Pouget-Abadie, Jetha Chan, Joe Stanton, John Wieting, Jonathan Lai, Jordi Orbay, Joseph Fernandez, Josh Newlan, Ju-yeong Ji, Jyotinder Singh, Kat Black, Kathy Yu, Kevin Hui, Kiran Vodrahalli, Klaus Greff, Linhai Qiu, Marcella Valentine, Marina Coelho, Marvin Ritter, Matt Hoffman, Matthew Watson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar, Natasha Noy, Nathan Byrd, Nick Roy, Nikola Momchev, Nilay Chauhan, Noveen Sachdeva, Oskar Bunyan, Pankil Botarda, Paul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya Tafti, Rakesh Shivanna, Renjie Wu, Renke Pan, Reza Rokni, Rob Willoughby, Rohith Vallu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy, Shruti Sheth, Siim Põder, Sijal Bhatnagar, Sindhu Raghuram Panyam, Sivan Eiger, Susan Zhang, Tianqi Liu, Trevor Yacovone, Tyler Liechty, Uday Kalra, Utku Evci, Vedant Misra, Vincent Roseberry, Vlad Feinberg, Vlad Kolesnikov, Woohyun Han, Woosuk Kwon, Xi Chen, Yinlam Chow, Yuvein Zhu, Zichuan Wei, Zoltan Egyed, Victor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar, Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris Warkentin, Vahab Mirrokni, Evan Senter, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Yossi Matias, D. Sculley, Slav Petrov, Noah Fiedel, Noam Shazeer, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Jean-Baptiste Alayrac, Rohan Anil, Dmitry, Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19786
论文链接: https://arxiv.org/pdf/2503.19786

英文摘要

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

中文摘要

我们介绍了Gemma 3，这是杰玛（Gemma）轻巧开放模型家族的多模式，规模从1到270亿个参数不等。此版本介绍了愿景理解能力，更广泛的语言覆盖范围和更长的上下文 - 至少128K令牌。我们还更改了模型的体系结构，以减少倾向于在长篇小说中爆炸的KV-CACHE内存。这是通过增加本地注意层与全球注意力层的比率，并保持局部关注的范围。Gemma 3型号经过蒸馏训练，并在预训练和指令列式列出版本中获得了与Gemma 2的优越性能。特别是，我们的新颖培训后食谱可显着改善数学，聊天，指导跟踪和多语言能力，从而使Gemma3-4b-it与Gemma2-27b-it和gemma3-27b-it竞争与Gemini-1.5-Pro相当。我们将所有模型都发布给社区。

地图：一个基于七大人格和苏格拉底式多模式科学问题解决的多代理框架

标题: MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
作者: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Xinyu Zhang, Fangzhi Xu, Qika Lin, Rui Mao, Erik Cambria, Jun Liu
日期: 2025-03-21
ArXiv主页: https://arxiv.org/abs/2503.16905
论文链接: https://arxiv.org/pdf/2503.16905

英文摘要

Multimodal scientific problems (MSPs) involve complex issues that require the integration of multiple modalities, such as text and diagrams, presenting a significant challenge in artificial intelligence. While progress has been made in addressing traditional scientific problems, MSPs still face two primary issues: the challenge of multi-modal comprehensive reasoning in scientific problem-solving and the lack of reflective and rethinking capabilities. To address these issues, we introduce a Multi-Agent framework based on the Big Seven Personality and Socratic guidance (MAPS). This framework employs seven distinct agents that leverage feedback mechanisms and the Socratic method to guide the resolution of MSPs. To tackle the first issue, we propose a progressive four-agent solving strategy, where each agent focuses on a specific stage of the problem-solving process. For the second issue, we introduce a Critic agent, inspired by Socratic questioning, which prompts critical thinking and stimulates autonomous learning. We conduct extensive experiments on the EMMA, Olympiad, and MathVista datasets, achieving promising results that outperform the current SOTA model by 15.84% across all tasks. Meanwhile, the additional analytical experiments also verify the model’s progress as well as generalization ability.

中文摘要

多模式科学问题（MSP）涉及需要整合多种模式（例如文本和图表）的复杂问题，在人工智能中提出了重大挑战。尽管在解决传统科学问题方面取得了进展，但MSP仍然面临两个主要问题：多模式的综合推理在科学问题解决方面的挑战以及缺乏反思性和重新思考的能力。为了解决这些问题，我们根据七大人格和苏格拉底指导（MAP）介绍了一个多代理框架。该框架采用了七个不同的代理，可利用反馈机制和苏格拉底式方法来指导MSP的分辨率。为了解决第一个问题，我们提出了一种进步的四个代理解决策略，每个代理都专注于解决问题过程的特定阶段。对于第二期，我们介绍了一位受苏格拉底询问的启发的评论家，这促使批判性思维并刺激自主学习。我们在Emma，Olympiad和Mathvista数据集上进行了广泛的实验，实现了令人鼓舞的结果，在所有任务中，当前的SOTA模型都超过了15.84％。同时，附加的分析实验还验证了模型的进步以及泛化能力。

DITA：通才视觉语言行动政策的扩展扩散变压器

标题: Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
作者: Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19757
论文链接: https://arxiv.org/pdf/2503.19757
项目链接: http://robodita.github.io/
gitHub仓库: https://github.com/RoboDita/Dita

英文摘要

While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning – enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer’s scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

中文摘要

尽管在不同机器人数据集进行培训的近期视觉语言行动模型具有有限的概括能力，并且内域数据有限，但它们对紧凑型动作头的依赖以预测离散的或连续的动作限制了对异构作用空间的适应性。我们提出了DITA，这是一个可扩展的框架，该框架利用变压器体系结构通过统一的多模态扩散过程直接将连续的动作序列定义。DITA偏离了先前的方法，即通过浅网络在融合嵌入的嵌入方式上进行了固定的嵌入，DITA采用了内在的条件 - 从历史观察结果中实现了DeNoised动作和原始视觉令牌之间的细粒度对齐。该设计明确对动作三角洲和环境细微差别进行建模。通过将扩散动作DeNoiser与变压器的可扩展性缩放，DITA有效地整合了跨不同摄像机视角，观察场景，任务和动作空间的跨座位数据集。这种协同作用增强了针对各种差异的鲁棒性，并促进了长期任务的成功执行。跨广泛基准的评估表明模拟中的最新或比较性能。值得注意的是，DITA仅使用第三人称摄像机输入，通过10次射击实现了对环境差异和复杂的长距离任务的强大改编。该体系结构为通用机器人政策学习建立了一种多功能，轻巧和开放源基线。项目页面：https：//robodita.github.io。

对长上下文语言建模的全面调查

标题: A Comprehensive Survey on Long Context Language Modeling
作者: Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang
日期: 2025-03-20
ArXiv主页: https://arxiv.org/abs/2503.17407
论文链接: https://arxiv.org/pdf/2503.17407
项目链接: https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling
gitHub仓库: https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling

英文摘要

Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling{\color[RGB]{175,36,67}{LCLM-Horizon}}.

中文摘要

在自然语言处理中，有效地处理长篇小说一直是一种持续的追求。随着越来越多的文档，对话和其他文本数据的数量，开发长上下文语言模型（LCLM）很重要，这些语言模型（LCLM）可以以有效而有效的方式处理和分析广泛的输入。在本文中，我们介绍了一项有关大语模型长篇小说建模的最新进展的综合调查。我们的调查围绕三个关键方面进行了结构：如何获得有效，有效的LCLM，如何有效培训和部署LCLM，以及如何全面评估和分析LCLM。对于第一个方面，我们讨论了以长上下文处理为导向的数据策略，架构设计和工作流程方法。在第二方面，我们对LCLM培训和推理所需的基础设施进行了详细的检查。对于第三方面，我们提出了长篇文化理解和长期产生以及LCLM的行为分析和机制的评估范例。除了这三个关键方面，我们还彻底探讨了已经部署现有LCLM的各种应用程序方案并概述了有希望的未来开发方向。这项调查提供了有关长篇文献LLM的文献的最新审查，我们希望为研究人员和工程师提供宝贵的资源。相关的GitHub存储库收集了最新论文和存储库，请访问：https：//github.com/lclm-horizon/a-comprehand-survey-survey-survey-for-long-context-context-context-language-modeling {\ color [\ color [rgb]

开放式搜索：通过开源推理代理商将搜索民主化搜索

标题: Open Deep Search: Democratizing Search with Open-source Reasoning Agents
作者: Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, Himanshu Tyagi, Pramod Viswanath
日期: 2025-03-26
ArXiv主页: https://arxiv.org/abs/2503.20201
论文链接: https://arxiv.org/pdf/2503.20201
gitHub仓库: https://github.com/sentient-agi/OpenDeepSearch

英文摘要

We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity’s Sonar Reasoning Pro and OpenAI’s GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs – for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES – with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.

中文摘要

我们介绍开放的深度搜索（ODS），以缩小专有搜索AI解决方案之间的差距，例如《困惑的声纳推理》和OpenAI的GPT-4O搜索预览及其开源对应物。ODS引入的主要创新是通过推理代理可以明智地使用Web搜索工具来回答查询的最新开源LLM的推理能力。具体而言，ODS由两个组件组成，这些组件与用户选择的基本LLM一起使用：打开搜索工具和打开推理代理。开放推理代理解释给定的任务，并通过策划包括通话工具的一系列操作来完成它，其中之一是开放搜索工具。开放搜索工具是一种新颖的网络搜索工具，其表现优于专有对应物。加上强大的开源推理LLM，例如DeepSeek-R1，ODS几乎匹配，有时在两个基准上超过了现有的最新基线：SimpleQA和框架。例如，在框架评估基准测试中，ODS将最近发布的GPT-4O搜索预览的现有基线提高了9.7％。ODS是无缝增强所有LLM的一般框架（例如，DeepSeek-R1，在SimpleQA上可实现82.4％，框架的30.1％ - 具有搜索和推理能力以实现最新性能：SimpleQA的88.3％，框架为75.3％。

火星：一个多代理框架，纳入了自动化及时优化的苏格拉底指南

标题: MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
作者: Jian Zhang, Zhangqi Wang, Haiping Zhu, Jun Liu, Qika Lin, Erik Cambria
日期: 2025-03-21
ArXiv主页: https://arxiv.org/abs/2503.16874
论文链接: https://arxiv.org/pdf/2503.16874
gitHub仓库: https://github.com/exoskeletonzj/MARS

英文摘要

The basic question-answering format of large language models involves inputting a prompt and receiving a response, and the quality of the prompt directly impacts the effectiveness of the response. Automated Prompt Optimization (APO) aims to break free from the cognitive biases of manually designed prompts and explores a broader design space for prompts. However, existing APO methods suffer from limited flexibility of fixed templates and inefficient search in prompt spaces as key issues. To this end, we propose a Multi-Agent framework Incorporating Socratic guidance (MARS), which utilizes multi-agent fusion technology for automatic planning, with gradual continuous optimization and evaluation. Specifically, MARS comprises seven agents, each with distinct functionalities, which autonomously use the Planner to devise an optimization path that ensures flexibility. Additionally, it employs a Teacher-Critic-Student Socratic dialogue pattern to iteratively optimize the prompts while conducting effective search. We conduct extensive experiments on various datasets to validate the effectiveness of our method, and perform additional analytical experiments to assess the model’s advancement as well as the interpretability.

中文摘要

大语言模型的基本提问格式涉及输入提示并接收回答，并且提示的质量直接影响了响应的有效性。自动提示优化（APO）旨在摆脱手动设计的提示的认知偏见，并探索更广泛的提示设计空间。但是，现有的APO方法的固定模板的灵活性有限，并且在及时空间中作为关键问题的效率低下。为此，我们提出了一个包含苏格拉底指南（MARS）的多代理框架，该框架利用多代理融合技术进行自动计划，并逐渐连续优化和评估。具体而言，火星包括七个代理，每个代理具有不同的功能，它们自主使用计划者来设计确保灵活性的优化路径。此外，它采用教师认真的苏格拉底对话模式来迭代地在进行有效搜索时优化提示。我们在各种数据集上进行了广泛的实验，以验证方法的有效性，并执行其他分析实验以评估模型的进步以及可解释性。

将视力预先培训缩放到4K分辨率

标题: Scaling Vision Pre-Training to 4K Resolution
作者: Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19903
论文链接: https://arxiv.org/pdf/2503.19903
项目链接: https://nvlabs.github.io/PS3/
gitHub仓库: https://github.com/NVlabs/PS3

英文摘要

High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

中文摘要

对视觉细节的高分辨率感知对于日常任务至关重要。但是，由于处理较大的图像的二次成本，当前视力预训练仍然仅限于低分辨率（例如378 x 378像素）。我们介绍了PS3，将夹式视觉预训练的范围扩展到4K分辨率，成本接近恒定的成本。PS3不是通过选择性处理本地区域并将其与本地详细的字幕进行对比的，而不是对全球图像表示的对比度学习，而是可以大大减少计算开销，从而使它们与本地详细的字幕进行了对比。预先训练的PS3能够以低分辨率编码全局图像，又能够根据其显着性或与文本提示的相关性进行选择性处理本地高分辨率区域。当将PS3应用于多模式LLM（MLLM）时，与没有高分辨率视觉预训练（如Anyres and S^2）相比，所得模型（称为Vila-HD）显着改善了高分辨率的视觉感知，同时使用了多达4.3倍的令牌。PS3还解锁了Vila-HD的吸引力缩放属性，包括免费扩展分辨率并扩大测试时间计算以提高性能。与艺术状态相比，Vila-HD在多个基准测试基准中优于先前的MLLM，例如NVILA和QWEN2-VL，并且比最新的标记修剪方法提高了效率。最后，我们发现当前的基准测试并不要求4K分辨率的感知能力，这促使我们提出了4KPro，一个全新的4K分辨率图像问答基准测试。在4KPro上，VILA-HD的表现优于所有以往的模型，包括比GPT-4o提高了14.5%，比Qwen2-VL提高了3.2%，速度提升了2.96倍。

RoboFactory：探索具有组成约束的体现代理协作

标题: RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
作者: Yiran Qin, Li Kang, Xiufeng Song, Zhenfei Yin, Xiaohong Liu, Xihui Liu, Ruimao Zhang, Lei Bai
日期: 2025-03-20
ArXiv主页: https://arxiv.org/abs/2503.16408
论文链接: https://arxiv.org/pdf/2503.16408
项目链接: https://iranqin.github.io/robofactory/

英文摘要

Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.

中文摘要

设计有效的体现多代理系统对于解决跨领域的复杂现实世界任务至关重要。由于多代理体现系统的复杂性，现有方法无法自动为此类系统生成安全有效的培训数据。为此，我们提出了针对体现的多代理系统组成约束的概念，以应对体现代理之间的协作所带来的挑战。我们设计了针对不同类型的约束的各种界面，从而使与物理世界无缝互动。利用组合约束和专门设计的接口，我们为具体的多代理系统开发了一个自动数据收集框架，并介绍了第一个用于体现多代理操作的基准，即RoboFactory。基于RoboFactory基准测试，我们适应并评估模仿学习的方法，并在不同的难度代理任务中分析其性能。此外，我们还探索了多代理模仿学习的体系结构和培训策略，旨在建立安全有效的体现多代理系统。

挑战推理的界限：大型语言模型的奥林匹克级数学基准

标题: Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
作者: Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, Ji-Rong Wen
日期: 2025-03-27
ArXiv主页: https://arxiv.org/abs/2503.21380
论文链接: https://arxiv.org/pdf/2503.21380
gitHub仓库: https://github.com/RUCAIBox/OlymMATH

英文摘要

In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI’s o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

中文摘要

近年来，大型推理模型的快速发展导致现有基准饱和，以评估数学推理，从而强调了迫切需要更具挑战性和严格的评估框架。为了解决这一差距，我们介绍了奥林马斯（Olymmath），这是一种新型的奥林匹克级数学基准，旨在严格测试LLMS的复杂推理能力。奥运会具有200个精心策划的问题，每个问题都经过手动验证，并提供平行的英语和中文版本。这些问题被系统地组织为两个不同的困难层：（1）AIME级问题（简单）为数学推理评估建立基线，以及（2）旨在突破最新现有模型界限的更具挑战性的问题（硬）。在我们的基准测试中，这些问题跨越了四个核心数学字段，每个字段包括可验证的数值解决方案，以实现基于规则的评估。经验结果强调了奥运会带来的重大挑战，包括DeepSeek-R1和OpenAI的O3-Mini在内的最先进的模型表明，在硬式子集上的准确性显着有限。此外，基准有助于对数学推理能力的全面双语评估 - 在主流数学推理基准基准中仍然在很大程度上未受压制的临界维度。我们在静止项目中发布奥运会基准：https：//github.com/rucaibox/slow_thinking_with_llms。

修改大型语言模型后培训以进行多样化的创意写作

标题: Modifying Large Language Model Post-Training for Diverse Creative Writing
作者: John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, Max Kreminski
日期: 2025-03-21
ArXiv主页: https://arxiv.org/abs/2503.17126
论文链接: https://arxiv.org/pdf/2503.17126

英文摘要

As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation – the degree of difference between a training sample and all other samples with the same prompt – in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.

中文摘要

由于创意写作任务没有单数正确的答案，因此训练执行这些任务的大型语言模型（LLM）应该能够产生多样化的有效输出。但是，LLM训练后通常着重于提高发电质量，但忽略了以促进产出多样性。因此，在创意写作中，我们研究了培训后的方法，以促进产出多样性和质量。我们的核心思想是包括偏差 - 培训样本与所有其他提示的所有其他样本之间的差异程度 - 在培训目标中促进从罕见的高质量实例中学习。通过采用我们直接偏好优化（DPO）和优势优先优化（ORPO）的方法，我们证明我们可以促进受过训练的模型的输出多样性，同时降低质量。我们使用8b参数的最佳模型可以作为人为创建的数据集实现出色的多样性，同时具有与我们研究过的最佳指令调整模型，GPT-4O和DeepSeek-R1相似的输出质量。我们通过人类评估，消融和与现有多元化方法Divpo的比较进一步验证了我们的方法。

乐高点：多步空间推理的MLLM的好处有多好？

标题: LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
作者: Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19990
论文链接: https://arxiv.org/pdf/2503.19990
项目链接: https://tangkexian.github.io/LEGO-Puzzles/
gitHub仓库: https://github.com/Tangkexian/LEGO-Puzzles

英文摘要

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. In addition to VQA tasks, we evaluate MLLMs’ abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs’ spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

中文摘要

多步空间推理需要对多个顺序步骤的空间关系的理解和推理，这对于解决复杂的现实世界应用至关重要，例如机器人操纵，自主导航和自动组装。为了评估当前的多模式大型语言模型（MLLM）如何获得了这种基本能力，我们引入了乐高积木，这是一种可扩展的基准测试，旨在通过基于乐高的任务评估MLLM中的空间理解和顺序推理。乐高点由1,100个经过精心策划的视觉询问（VQA）样品组成，这些样本涵盖11个不同的任务，从基本的空间理解到复杂的多步推理。基于乐高点式群体，我们对最先进的MLLM进行了全面评估，并发现其空间推理能力的重大限制：即使是最强大的MLLM也只能回答大约一半的测试案例，而人类参与者的精度超过90％。除了VQA任务外，我们还评估了MLLM的能力在汇编插图之后生成乐高图像。我们的实验表明，只有GEMINI-2.0-FLASH和GPT-4O具有遵循这些说明的能力有限，而其他MLLM可以复制输入图像或产生完全无关紧要的输出。总体而言，Lego-Puzzles在现有MLLM的空间理解和顺序推理能力中暴露了关键缺陷，并强调了多模式空间推理的进一步进步。

通过随机产生和滚动预算强迫进行流程模型的推理时间缩放

标题: Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing
作者: Jaihoon Kim, Taehoon Yoon, Jisung Hwang, Minhyuk Sung
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19385
论文链接: https://arxiv.org/pdf/2503.19385
项目链接: https://flow-inference-time-scaling.github.io/

英文摘要

We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models–offering faster generation and high-quality outputs in state-of-the-art image and video generative models–efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

中文摘要

我们建议用于预审流的流动模型的推理时间缩放方法。最近，推理时间缩放在LLM和扩散模型中引起了极大的关注，通过利用其他计算来改善样品质量或更好地对准输出，以用户偏好对齐。对于扩散模型，由于中间去胶的步骤的随机性，粒子采样允许更有效的缩放。相反，尽管流动模型已成为扩散模型的替代品（可在最新的图像和视频生成模型中提供更快的生成和高质量输出）的替代性，因此无法直接应用用于扩散模型的推理时间缩放方法，因为它们的确定性生成过程无法直接应用。为了实现流量模型的有效推理时间缩放，我们提出了三个关键思想：1）基于SDE的生成，在流模型中启用粒子采样，2）插值转换，扩大搜索空间和增强样本多样性，以及3）推销预算强迫（RBF），一种适应性的计算资源分配，以跨时间浏览到跨时间跨度的预算UPIDIATION。我们的实验表明，基于SDE的生成，尤其是基于方差的interpolant interpolant生成，可提高流量模型中推理时间缩放的粒子采样方法的性能。此外，我们证明了带有VP-SDE的RBF实现最佳性能，表现优于所有以前的推理时间缩放方法。

桥接连续和离散令牌以进行自回归视觉生成

标题: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
作者: Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu
日期: 2025-03-20
ArXiv主页: https://arxiv.org/abs/2503.16430
论文链接: https://arxiv.org/pdf/2503.16430
项目链接: https://yuqingwang1029.github.io/TokenBridge/
gitHub仓库: https://github.com/YuqingWang1029/TokenBridge

英文摘要

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

中文摘要

自回归的视觉生成模型通常依靠令牌将图像压缩到可以依次预测的令牌中。代币表示存在着根本的困境：离散令牌可以直接建模，并具有标准的横向损失，但由于信息丢失和令牌训练的不稳定性而受到影响；连续令牌更好地保留视觉细节，但需要复杂的分布建模，使生成管道变得复杂。在本文中，我们提出了Tokenbridge，它通过保持连续令牌的强大表示能力，同时保留离散令牌的建模简单性，从而弥合了这一差距。为了实现这一目标，我们通过直接从连续表示形式获得离散令牌的训练后量化来使离散化的离散化与训练后量化。具体而言，我们引入了尺寸量化策略，该策略将每个特征维度独立离散，并与轻巧的自动回归预测机制配对，该机制有效地对所得的大令牌空间进行了建模。广泛的实验表明，我们的方法在使用标准分类预测的同时，通过连续方法实现重建和发电质量。这项工作表明，桥接离散和连续的范式可以有效地利用这两种方法的优势，从而通过简单的自动回归建模为高质量的视觉产生提供了有希望的方向。项目页面：https：//yuqingwang1029.github.io/tokenbridge。

VBENCH-2.0：推进视频生成基准套件的内在忠诚

标题: VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
作者: Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei Liu
日期: 2025-03-27
ArXiv主页: https://arxiv.org/abs/2503.21755
论文链接: https://arxiv.org/pdf/2503.21755

英文摘要

Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

中文摘要

视频生成已经大大发展，从产生不切实际的产出到生成视觉上令人信服且具有时间连贯的视频。为了评估这些视频生成模型，已经开发了诸如VBENCH之类的基准来评估其忠诚，衡量诸如人均美学，时间一致性和基本及时依从性等因素。但是，这些方面主要代表肤浅的忠诚，这些方面的重点是视频是否在视觉上令人信服，而不是遵守现实世界的原则。尽管最近的模型在这些指标上的表现越来越好，但他们仍然很难生成视频，这些视频在视觉上是可行的，而且从根本上是现实的。为了通过视频生成来实现真实的“世界模型”，下一个边界在于内在的忠诚，以确保生成的视频遵守物理定律，常识性推理，解剖学正确性和组成完整性。实现这一水平的现实主义对于诸如AI辅助电影制作和模拟世界建模等应用至关重要。为了弥合这一差距，我们引入了VBENCH-2.0，这是一种下一代基准测试，旨在自动评估视频生成模型的内在忠诚度。VBENCH-2.0评估了五个关键方面：人类的忠诚，可控性，创造力，物理和常识，每个都进一步分解为细粒度的能力。我们的评估框架是针对各个维度量身定制的，它整合了最先进的VLM和LLM等通才主义者，以及专家，包括为视频生成提出的异常检测方法。我们进行广泛的注释，以确保与人类判断保持一致。VBENCH-2.0旨在为下一代视频生成模型建立新的标准，以追求固有的忠诚度。

在视频理解中探索大型多模型的幻觉：基准，分析和缓解

标题: Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
作者: Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, Qingming Huang
日期: 2025-03-25
ArXiv主页: https://arxiv.org/abs/2503.19622
论文链接: https://arxiv.org/pdf/2503.19622
gitHub仓库: https://github.com/Hongcheng-Gao/HAVEN

英文摘要

The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.

中文摘要

大型多模型模型（LMM）的幻觉提供了看起来正确但实际上不正确的响应，限制了它们的可靠性和适用性。本文旨在研究视频方式中LMM的幻觉问题，与图像和文本（如图像和文本）相比，这是动态的，更具挑战性的。从这种动机中，我们首先提出了一个全面的基准，称为避风港，用于评估视频理解任务中LMM的幻觉。它建立在三个维度，即幻觉原因，幻觉方面和问题格式的基础上，导致了6K问题。然后，我们对幻觉的7个影响因素进行了定量研究，例如视频，模型大小和模型推理的持续时间，通过在提出的基准上的16 LMMS实验。此外，受到OpenAI O1等最新思维模型的启发，我们提出了一种视频思维模型，以通过监督推理微调（SRFT）和直接偏好优化（TDPO）来减轻LMM的幻觉，其中SRFT增强了推理功能，而TDPO则减少了思维思维过程中的幻觉。广泛的实验和分析证明了有效性。值得注意的是，它在幻觉评估方面的准确性将基线提高了7.65％，并将偏置评分降低了4.5％。代码和数据在https://github.com/hongcheng-gao/haven上是公开的。

Simpler-ZOO：研究和驯服零加强学习的开放基础模型

标题: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
作者: Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He
日期: 2025-03-24
ArXiv主页: https://arxiv.org/abs/2503.18892
论文链接: https://arxiv.org/pdf/2503.18892
gitHub仓库: https://github.com/hkust-nlp/simpleRL-reason

英文摘要

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the “aha moment”). Notably, we observe the “aha moment” for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

中文摘要

DeepSeek-R1表明，长期的经过思考（COT）推理可以通过简单的加固学习（RL）框架自然出现，并具有基于规则的奖励，在这种奖励中，训练可能直接从基本型号开始，即A范式-A范式称为零RL训练。重现零RL培训的最新努力主要集中在QWEN2.5模型系列上，这可能不是代表性的，因为我们发现基本模型已经表现出强大的指导和自我反射能力。在这项工作中，我们研究了10种不同基本模型的零RL培训，这些培训涵盖了不同的家庭和尺寸，包括Llama3-8B，Mistral-7b/24b，DeepSeek-Math-7B，Qwen2.5-Math-7b以及所有QWEN2.5-MATH-7B和所有Qwen2.5型号，从0.5B到32B。利用几种关键的设计策略，例如调整格式奖励和控制查询难度，我们在大多数设置的推理准确性和响应长度方面都取得了重大改进。但是，通过仔细监视训练动力学，我们观察到不同的基本模型在训练过程中表现出不同的模式。例如，增加的响应长度并不总是与某些认知行为（例如验证（即“ aha arment’'”等某些认知行为的出现）。值得注意的是，我们在不是Qwen家族的小型模型中第一次观察到“ AHA时刻”。我们分享了能够成功的零RL培训以及我们的发现和实践的关键设计。为了促进进一步的研究，我们开源代码，模型和分析工具。

查看全文

http://www.dtcms.com/a/394303.html