当前位置：首页 > news >正文

【论文速递】2025年第17周(Apr-20-26)(Robotics/Embodied AI/LLM)

news 2025/9/22 13:56:29

中文使用 googletrans 翻译，翻译不对的地方以英文为准

强化学习是否真的激励LLM中的推理能力超出基本模型？
- 英文摘要
- 中文摘要
Kuwain 1.5b：通过语言注射的阿拉伯语SLM
- 英文摘要
- 中文摘要
TTRL：测试时间增强学习
- 英文摘要
- 中文摘要
Paper2Code：从机器学习中的科学论文中生成自动代码
- 英文摘要
- 中文摘要
step1x-edit：用于一般图像编辑的实用框架
- 英文摘要
- 中文摘要
学会在非政策指导下推理
- 英文摘要
- 中文摘要
Visulogic：用于评估多模式大语模型视觉推理的基准
- 英文摘要
- 中文摘要
EAGLE 2.5：为边界视觉模型增强长篇文化训练后训练
- 英文摘要
- 中文摘要
从2,000多种多语言基准中学到的痛苦教训
- 英文摘要
- 中文摘要
描述任何内容：详细的本地图像和视频字幕
- 英文摘要
- 中文摘要
repvnli：朝着主题驱动的文本对图像生成的可扩展评估
- 英文摘要
- 中文摘要
蒂娜：通过洛拉的微小推理模型
- 英文摘要
- 中文摘要
DreamID：高保真和通过三胞胎ID组学习的高保真和基于快速扩散的面部交换
- 英文摘要
- 中文摘要
流动史：加强查询级别的元代理
- 英文摘要
- 中文摘要
工具：奖励是所有工具学习需求
- 英文摘要
- 中文摘要
通过语言模型学习自适应平行推理
- 英文摘要
- 中文摘要
Noderag：与异质节点结构基于图的抹布
- 英文摘要
- 中文摘要
打破模态障碍：使用多模式LLM的通用嵌入学习
- 英文摘要
- 中文摘要
LIVECC：学习视频LLM，带有流式语音转录的大规模转录
- 英文摘要
- 中文摘要
7B技术报告
- 英文摘要
- 中文摘要
MIG：通过在语义空间中最大化信息增益来调整指令调整的自动数据选择
- 英文摘要
- 中文摘要
Phybench：大语言模型中身体感知和推理的整体评估
- 英文摘要
- 中文摘要
OTC：通过加固学习的最佳工具调用
- 英文摘要
- 中文摘要
X团队：具有自适应多代理的多转弯越狱和防御
- 英文摘要
- 中文摘要
通过心理图像模拟，视觉模型中的透视感知推理
- 英文摘要
- 中文摘要
I-CON：代表学习的统一框架
- 英文摘要
- 中文摘要

强化学习是否真的激励LLM中的推理能力超出基本模型？

标题: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
作者: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
日期: 2025-04-18
ArXiv主页: https://arxiv.org/abs/2504.13837
论文链接: https://arxiv.org/pdf/2504.13837
项目链接: https://limit-of-rlvr.github.io/
gitHub仓库: https://github.com/LeapLabTHU/limit-of-RLVR

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models’ capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models’ sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model’s output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

中文摘要

通过可验证的奖励（RLVR）的增强学习最近在增强LLM的推理能力方面取得了显着成功，尤其是在数学和编程任务中。人们普遍认为，RLVR使LLM能够持续自我爆发，从而获得超过相应基本模型能力的新型推理能力。然而，在这项研究中，我们通过测量具有k的Pass@k度量来探索模型范围跨广泛的模型家族和基准的模型的推理能力边界来进行严格的重新审查。令人惊讶的是，实际上，RL并没有引起新的新推理模式。尽管RL训练的模型在较小的k值（\ k = 1）下优于其基本模型，但与大k值相比，基本模型可以达到可比的或更高的Pass@K分数。由RL训练的模型生成的推理路径已经包含在基本模型的采样分布中，这表明基本模型已经获得了在RL训练模型中表现出的大多数推理能力。进一步的分析表明，RL训练通过将模型的输出分布偏向更可能产生奖励的路径来提高性能，从而更有效地对正确的响应进行采样。但这也导致与基本模型相比，推理能力边界更窄。在接受RLVR训练的视觉推理任务中观察到了类似的结果。此外，我们发现蒸馏可以真正将新知识引入该模型，这与RLVR不同。这些发现强调了RLVR在推进LLM推理能力方面的关键局限性，这要求我们从根本上重新考虑RL培训对LLMS的影响以及需要更好的范式。项目页面：https：//limit-of-rlvr.github.io

Kuwain 1.5b：通过语言注射的阿拉伯语SLM

标题: Kuwain 1.5B: An Arabic SLM via Language Injection
作者: Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15120
论文链接: https://arxiv.org/pdf/2504.15120
gitHub仓库: https://github.com/misraj-ai/Kuwain-Arabic-cleaner

英文摘要

Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model’s existing knowledge with a minimum amount of the original model’s data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.

中文摘要

用新知识增强现有模型是AI开发的关键方面。本文介绍了一种将新语言集成到大语言模型（LLM）中的新方法。我们的方法成功地将以前看不见的目标语言纳入了现有的LLM，而不会损害其先验知识。我们通过将阿拉伯语注入主要接受英语训练的小型开源模型，培训了一个小型模型，该模型具有15亿个名为Kuwain的参数。我们的方法表明，阿拉伯语语言表现有了显着改善，在各种基准测试中的平均提高了8％，同时使用最少数量的原始模型数据保留了模型的现有知识。这为培训英语和阿拉伯语的综合模型提供了一种具有成本效益的替代方法。结果突出了有效的，有针对性的语言模型扩展的潜力，而无需大量的再培训或资源密集的过程。

TTRL：测试时间增强学习

标题: TTRL: Test-Time Reinforcement Learning
作者: Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.16084
论文链接: https://arxiv.org/pdf/2504.16084
gitHub仓库: https://github.com/PRIME-RL/TTRL

英文摘要

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL’s potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

中文摘要

本文在没有明确标签的大型语言模型（LLMS）的推理任务的数据上研究了加强学习（RL）。该问题的核心挑战是推理期间的奖励估计，而无法访问地面信息。尽管这种设置似乎难以捉摸，但我们发现测试时间缩放（TTS）（例如多数投票）中的常见做法令人惊讶地获得适合推动RL培训的有效奖励。在这项工作中，我们介绍了测试时间增强学习（TTRL），这是一种使用RL在未标记数据上训练LLM的新方法。TTRL通过在预训练的模型中利用先验来实现LLM的自我进化。我们的实验表明，TTRL始终提高各种任务和模型的性能。值得注意的是，TTRL在AIME 2024上仅使用未标记的测试数据将QWEN-2.5-MATH-7B的PASS@1的性能提高了约159％。此外，尽管TTRL仅由Maj@N Metric监督，但TTRL表现出了绩效，以始终超过初始模型的上限，并接近直接在测试数据上使用地面真实标签训练的模型的性能。我们的实验发现验证了TTRL在各种任务中的总体有效性，并突出了TTRL的更广泛的任务和领域的潜力。github：https：//github.com/prime-rl/ttrl

Paper2Code：从机器学习中的科学论文中生成自动代码

标题: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
作者: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang
日期: 2025-04-24
ArXiv主页: https://arxiv.org/abs/2504.17192
论文链接: https://arxiv.org/pdf/2504.17192
gitHub仓库: https://github.com/going-doer/Paper2Code

英文摘要

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.

中文摘要

尽管机器学习研究的迅速增长，但相应的代码实施通常无法实现，这使研究人员重现结果并在先前的工作基础上进行劳动密集型。同时，最近的大型语言模型（LLMS）在了解科学文档并生成高质量代码方面表现出色。受此启发的启发，我们介绍了PaperCoder，这是一个多代理LLM框架，将机器学习论文转换为功能代码存储库。PaperCoder在三个阶段进行操作：规划，它在其中构建高级路线图，使用图表设计系统体系结构，标识文件依赖项并生成配置文件；分析，侧重于解释特定于实施的细节；和生成，在其中产生模块化，依赖感知代码。此外，每个阶段都是通过旨在在整个管道中有效协作的一组专业代理来实例化的。然后，我们根据基于模型和人类评估的机器学习论文（特别是来自原始纸张作者）的机器学习论文生成代码实现的评估，以作者发行的存储库作为基础真相。我们的结果表明，PaperCoder在创建高质量，忠实的实施方面的有效性。此外，它始终显示出最近发布的PaperBench基准测试中的优势，超过了强大的基线。

step1x-edit：用于一般图像编辑的实用框架

标题: Step1X-Edit: A Practical Framework for General Image Editing
作者: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
日期: 2025-04-24
ArXiv主页: https://arxiv.org/abs/2504.17761
论文链接: https://arxiv.org/pdf/2504.17761
gitHub仓库: https://github.com/stepfun-ai/Step1X-Edit

英文摘要

In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user’s editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

中文摘要

近年来，图像编辑模型见证了巨大而快速的发展。GPT-4O和Gemini2 Flash等尖端多模型的最新揭幕引入了非常有前途的图像编辑功能。这些模型表明了满足绝大多数用户驱动的编辑要求的令人印象深刻的才能，这标志着图像操纵领域的显着进步。但是，这些封闭源型号之间的开源算法之间仍然存在很大的差距。因此，在本文中，我们旨在发布一个名为STEP1X-EDIT的最先进的图像编辑模型，该模型可以与GPT-4O和Gemini2 Flash（Gemini2 Flash）相比提供可比的性能。更具体地说，我们采用多模式LLM来处理参考图像和用户的编辑指令。已提取潜在嵌入并与扩散图像解码器集成以获得目标图像。为了训练模型，我们构建了数据生成管道以生成高质量的数据集。为了进行评估，我们开发了Gedit Bench，这是一种植根于现实世界用户说明的新颖基准。GEDIT板台上的实验结果表明，STEP1X-EDIT的表现优于现有的开源基线，并接近领先的专有模型的性能，从而为图像编辑领域做出了重大贡献。

学会在非政策指导下推理

标题: Learning to Reason under Off-Policy Guidance
作者: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.14945
论文链接: https://arxiv.org/pdf/2504.14945
gitHub仓库: https://github.com/ElliottYan/LUFFY

英文摘要

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy’', limiting learning to a model’s own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

中文摘要

大型推理模型（LRMS）的最新进展表明，诸如多步推理和自我反思之类的复杂行为可以通过强化学习（RL）以简单的基于规则的奖励而出现。但是，现有的Zero-RL方法本质上是``policy’'，将学习限制为模型自己的输出，并且无法获得超出其最初能力的推理能力。我们介绍了Luffy（学习在违反政策指导下进行推理），该框架以非政策推理痕迹来增强零RL。路飞通过在训练期间将板球示范与上车推出相结合，可以动态地平衡模仿和探索。值得注意的是，我们提出通过正规的重要性抽样进行政策塑造，以避免在混合政策培训期间表面上的模仿。值得注意的是，Luffy在六个数学基准中的平均增长率超过7.0，并且在分发任务中的优势超过6.2分。它还基本上超过了基于模仿的监督微调（SFT），尤其是在概括中。分析表明，路飞不仅有效模仿，而且还探索了超越演示的探索，为通过非政策指导提供了可扩展的训练可推广推理模型的途径。

Visulogic：用于评估多模式大语模型视觉推理的基准

标题: VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
作者: Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15279
论文链接: https://arxiv.org/pdf/2504.15279
项目链接: https://visulogic-benchmark.github.io/VisuLogic/
gitHub仓库: https://github.com/VisuLogic-Benchmark/VisuLogic-Eval

英文摘要

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.

中文摘要

视觉推理是人类智力的核心组成部分，也是高级多模型的关键能力。然而，当前对多模式大语言模型（MLLM）的推理评估通常依赖文本描述并允许基于语言的推理快捷方式，无法衡量以真正的视觉为中心的推理。为了解决这个问题，我们介绍了可视化的：跨六个类别（例如，定量转移，空间关系，属性比较）的1,000个人类验证问题的基准。可以评估这些各种类型的问题，以评估MLLM的视觉推理能力从多个角度评估。我们在此基准上评估了领先的MLLM，并分析其结果以识别常见的失败模式。大多数模型得分低于30％的精度，略高于25％的随机基线，而远低于人类在视觉推理中的巨大差距而获得的51.4％。此外，我们提供了一个补充培训数据集和增强学习基线，以支持进一步的进步。

EAGLE 2.5：为边界视觉模型增强长篇文化训练后训练

标题: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
作者: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15271
论文链接: https://arxiv.org/pdf/2504.15271
项目链接: https://nvlabs.github.io/EAGLE/
gitHub仓库: https://github.com/NVlabs/EAGLE

英文摘要

We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

中文摘要

我们介绍了Eagle 2.5，这是一个用于长篇文化多模式学习的Frontier Vision语言模型（VLMS）。我们的工作解决了长期视频理解和高分辨率图像理解中的挑战，并为这两个任务引入了通才框架。拟议的培训框架结合了自动降级采样和图像区域保存，这是保留上下文完整性和视觉细节的两种技术。该框架还包括用于长篇小说数据培训的管道中的许多效率优化。最后，我们提出了Eagle-Video-1110k，这是一个新颖的数据集，该数据集都集成了故事级和剪辑级的注释，从而促进了长期Video的理解。Eagle 2.5展示了长篇文化多模式基准的实质性改进，为现有VLM的局限性提供了强有力的解决方案。值得注意的是，我们最好的型号Eagle 2.5-8B在Video-MME上以512个输入框架获得了72.4％，与顶级商业型号（例如GPT-4O和大型开放源代码型号）（如QWEN2.5-VL-72B和InternVl2.5-78B）的结果相匹配。

从2,000多种多语言基准中学到的痛苦教训

标题: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
作者: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.15521
论文链接: https://arxiv.org/pdf/2504.15521

英文摘要

As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.

中文摘要

随着大型语言模型（LLMS）在语言能力上继续提高，强大的多语言评估对于促进公平的技术进步至关重要。该立场论文审查了来自2021年至2024年发表的148个国家 /地区的2,000多个多语言（非英语）基准，以评估过去，现在和未来的多语言基准测试实践。我们的发现表明，尽管大量投资总计数千美元，但在这些基准测试中，英语仍然大大超为代表。此外，大多数基准测试依赖原始语言内容而不是翻译，其中大多数来自中国，印度，德国，英国和美国等高资源国家。此外，基准表现与人类判断的比较突出了显着的差异。与STEM相关的任务表现出与人类评估（0.70至0.85）的密切相关性，而传统的NLP任务（例如问答（例如，Xquad））的相关性较弱（0.11至0.30）。此外，将英语基准转换为其他语言被证明不足，因为本地化的基准表现出与当地人类判断的一致性（0.68）的比对高于其翻译的对应物（0.47）。这强调了在文化和语言上量身定制的基准而不是仅依靠翻译的重要性。通过这项全面的分析，我们重点介绍了当前多语言评估实践中的六个关键局限性，提出了指导原则以进行有效的多语言基准测试，并概述了五个关键研究方向以推动该领域的进步。最后，我们呼吁进行全球协作努力，以开发优先考虑现实应用程序的人类一致的基准。

描述任何内容：详细的本地图像和视频字幕

标题: Describe Anything: Detailed Localized Image and Video Captioning
作者: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.16072
论文链接: https://arxiv.org/pdf/2504.16072
项目链接: https://describe-anything.github.io
gitHub仓库: https://github.com/NVlabs/describe-anything

英文摘要

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

中文摘要

为图像和视频中的特定区域生成详细而准确的描述仍然是视觉模型的基本挑战。我们介绍了描述任何模型（DAM），该模型是为详细的局部字幕（DLC）设计的。大坝通过两个关键创新来保留本地细节和全球环境：一个焦点提示，可确保对目标区域的高分辨率编码，以及一个局部视觉骨干，将精确的本地化与更广泛的环境相结合。为了解决高质量DLC数据的稀缺性，我们提出了一个基于半监督的学习（SSL）数据管道（DLC-SDP）。DLC-SDP从现有的分割数据集开始，然后使用SSL扩展到未标记的Web图像。我们介绍了DLC基础，这是一种旨在评估DLC的基准，而无需依赖参考字幕。大坝在跨越关键字级别，短语级别的7个基准测试基准上设置了新的最新技术，并详细的多句局部图像和视频字幕。

repvnli：朝着主题驱动的文本对图像生成的可扩展评估

标题: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
作者: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor
日期: 2025-04-24
ArXiv主页: https://arxiv.org/abs/2504.17502
论文链接: https://arxiv.org/pdf/2504.17502

英文摘要

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability – ranging from enhanced personalization in image generation to consistent character representation in video rendering – progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.

中文摘要

主题驱动的文本对图像（T2I）生成旨在产生与给定文本描述一致的图像，同时从引用的主题图像中保留视觉标识。尽管它的下游适用性广泛 - 从图像生成中的个性化增强到视频渲染中的一致性角色表示 - 该领域的进展受到缺乏可靠的自动评估的限制。现有方法仅评估任务的一个方面（即文本一致性或主题保存），与人类判断不一致，或依靠基于API的昂贵评估。为了解决这个问题，我们介绍了Refvnli，这是一种具有成本效益的度量标准，可以评估单个预测中的文本一致性和主题保存。在一个大规模的数据集中培训，该数据集源自视频 - 理论基准和图像扰动，Refvnli的表现优于多个基准和主题类别（例如，动物，对象）的现有基准，在文本一致性和8.5点增益中达到6.4点的增长。它还具有鲜为人知的概念，与超过87％精度的人类偏好保持一致。

蒂娜：通过洛拉的微小推理模型

标题: Tina: Tiny Reasoning Models via LoRA
作者: Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.15777
论文链接: https://arxiv.org/pdf/2504.15777
项目链接: https://shangshangwang.notion.site/tina
gitHub仓库: https://github.com/shangshang-wang/Tina

英文摘要

How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model’s underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights & checkpoints.

中文摘要

在语言模型中，如何实现强大的推理能力？在这个基本问题的驱动下，我们提出了蒂娜（Tina），这是一个以高成本效率实现的微小推理模型的家族。值得注意的是，蒂娜（Tina）证明，只能使用最小资源来开发实质性的推理性能，即通过在增强学习过程中应用参数有效的更新（RL），使用低级别适应性（LORA），并将其用于已经很小的1.5B参数基础模型。这种极简主义的方法产生的模型可以实现与同一基础模型建立的SOTA RL推理模型具有竞争力的推理性能。至关重要的是，这是在现有SOTA模型所采用的计算后培训成本的一小部分中实现的。实际上，最佳的TINA模型可实现> 20 \％的推理性能提高，而AIME24上的A级准确性为43.33 \％PASS@1，仅\ $ 9 USD培训和评估成本（即估计的260倍降低成本）。我们的工作揭示了通过Lora有效RL推理的令人惊讶的有效性。我们通过多个开源推理数据集和各种消融设置来验证这一点，从一组固定的超参数开始。此外，我们假设这种有效性和效率源于洛拉（Lora）迅速将模型调整为RL奖励的推理的结构形式，同时很大程度上保留了基本模型的基本知识。为了服务可访问性和开放研究，我们将所有代码，培训日志和模型权重\＆Checkpoints完全开放。

DreamID：高保真和通过三胞胎ID组学习的高保真和基于快速扩散的面部交换

标题: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
作者: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu
日期: 2025-04-20
ArXiv主页: https://arxiv.org/abs/2504.14509
论文链接: https://arxiv.org/pdf/2504.14509
项目链接: https://superhero-7.github.io/DreamID/
gitHub仓库: https://github.com/superhero-7/DreamID

英文摘要

In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

中文摘要

在本文中，我们介绍了DreamID，这是一种基于扩散的面部交换模型，可实现高水平的ID相似性，属性保存，图像保真度和快速推理速度。与典型的面部交换培训过程不同，这通常依赖于隐性的监督和努力实现令人满意的结果。DreamID通过构建三胞胎ID组数据来建立明确的面部交换监督，从而显着增强身份相似性和属性保存。扩散模型的迭代性质对利用有效的图像空间损耗函数构成了挑战，因为执行耗时的多步骤抽样以在训练过程中获得生成的图像是不切实际的。为了解决这个问题，我们利用加速的扩散模型SD涡轮增压器，将推理步骤减少到单个迭代，并通过显式三胞胎ID组监督，从而使有效的像素级端到端训练能够。此外，我们提出了一个改进的基于扩散的模型体系结构，其中包括Swapnet，FaceNet和ID适配器。这种强大的体系结构完全解锁了三胞胎ID组的明确监督的力量。最后，为了进一步扩展我们的方法，我们在训练过程中明确修改三胞胎ID组数据以微调和保留特定属性，例如眼镜和面部形状。广泛的实验表明，DreamID在身份相似性，姿势和表达保存以及图像保真度方面优于最先进的方法。总体而言，DreamID仅在0.6秒内就达到了512 *512分辨率的高质量交换结果，并且在具有挑战性的场景中表现出色，例如复杂的照明，大角度和遮挡。

流动史：加强查询级别的元代理

标题: FlowReasoner: Reinforcing Query-Level Meta-Agents
作者: Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15257
论文链接: https://arxiv.org/pdf/2504.15257

英文摘要

This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail-sg/FlowReasoner.

中文摘要

本文提出了一个名为FlowReasoner的查询级元代理，以自动化查询级别多代理系统的设计，即每个用户查询一个系统。我们的核心思想是通过外部执行反馈激励基于推理的元代理。具体而言，通过蒸馏DeepSeek R1，我们首先赋予了有关多代理系统生成的基本推理能力。然后，我们通过外部执行反馈通过加固学习（RL）进一步增强它。多功能奖励旨在指导性能，复杂性和效率方面的RL培训。通过这种方式，FlowReasoner可以通过协商推理为每个用户查询生成个性化的多代理系统。工程和竞争代码基准的实验证明了流动探索者的优势。值得注意的是，它在三个基准测试中超过O1-Mini的精度为10.52％。该代码可从https://github.com/sail-sg/flowrasoner获得。

工具：奖励是所有工具学习需求

标题: ToolRL: Reward is All Tool Learning Needs
作者: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
日期: 2025-04-16
ArXiv主页: https://arxiv.org/abs/2504.13958
论文链接: https://arxiv.org/pdf/2504.13958
gitHub仓库: https://github.com/qiancheng0/ToolRL

英文摘要

Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

中文摘要

当前的大型语言模型（LLMS）经常受到监督的微调（SFT），以获取工具使用功能。但是，SFT努力概括不熟悉或复杂的工具使用方案。强化学习的最新进步（RL），尤其是类似于R1的模型，已经证明了有希望的推理和概括能力。但是，工具使用的奖励设计带来了独特的挑战：多个工具可能会带有不同的参数，而粗糙的奖励信号（例如答案匹配）无法提供有效学习所需的细腻反馈。在这项工作中，我们介绍了RL范式内的工具选择和应用程序任务的首次综合研究。我们系统地探索了广泛的奖励策略，分析其类型，量表，粒度和时间动态。在这些见解的基础上，我们提出了一种针对工具使用任务的原则奖励设计，并将其应用于使用小组相对策略优化（GRPO）训练LLMS。跨不同基准测试的经验评估表明，我们的方法可以实现强大的，可扩展和稳定的培训，比基本模型提高了17％，比SFT模型获得了15％的增长。这些结果突出了周到的奖励设计在增强工具使用功能和LLM的概括性能方面的关键作用。所有代码均可促进未来的研究。

通过语言模型学习自适应平行推理

标题: Learning Adaptive Parallel Reasoning with Language Models
作者: Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15466
论文链接: https://arxiv.org/pdf/2504.15466
gitHub仓库: https://github.com/Parallel-Reasoning/APR

英文摘要

Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

中文摘要

扩展推理时间计算大大提高了语言模型的推理能力。但是，现有方法具有重大的局限性：序列化的思想链方法产生了过长的输出，从而导致延迟增加和疲惫的上下文窗口，而平行方法（例如自稳态）的协调不足，导致冗余计算和绩效有限的增长。为了解决这些缺点，我们提出了自适应平行推理（APR），这是一个新颖的推理框架，使语言模型能够端到端串联和并行计算进行协调。APR通过使用Spawn（）和Join（）操作启用自适应多线程推理来概括现有的推理方法。一个关键的创新是我们的端到端增强学习策略，优化父母和子女推理线程以提高任务成功率，而无需预定义的推理结构。在倒计时推理任务上的实验显示了APR的显着好处：（1）在同一上下文窗口中较高的性能（在4K上下文下为83.4％对60.0％）；（2）随着计算的提高（80.1％vs. 66.6％，在20K总代币时）出色的可伸缩性；（3）在同等潜伏期的准确性提高（75.2％vs. 57.3％，约5,000ms）。APR代表了使语言模型能够通过自适应分配计算来自主优化其推理过程的一步。

Noderag：与异质节点结构基于图的抹布

标题: NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
作者: Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun
日期: 2025-04-15
ArXiv主页: https://arxiv.org/abs/2504.11544
论文链接: https://arxiv.org/pdf/2504.11544
项目链接: https://terry-xu-666.github.io/NodeRAG_web/
gitHub仓库: https://github.com/Terry-Xu-666/NodeRAG

英文摘要

Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.

中文摘要

检索增强的生成（RAG）使大型语言模型能够访问外部和私人语料库，从而在特定领域中实现了一致的响应。通过利用语料库的固有结构，基于图的抹布方法通过构建知识图索引并利用图形的结构性质进一步丰富了这一过程。但是，当前基于图的抹布方法很少优先考虑图形结构的设计。设计不足的图不仅阻碍了各种图形算法的无缝集成，而且导致工作流不一致和性能退化。为了进一步释放抹布的图形潜力，我们提出了Noderag，这是一个以图形为中心的框架，引入了异质图结构，该结构可以使基于图形的方法论在RAG工作流程中无缝而整体整体整体。通过与LLM的功能紧密保持一致，该框架可确保完全有凝聚力和高效的端到端过程。通过广泛的实验，我们证明了Noderag比以前的方法具有性能优势，包括GraphRag和Lightrag，不仅在索引时间，查询时间和存储效率上，而且还可以在多跳上的基准和开放式高管评估中进行最小的回程评估，并提供最小的回程。我们的GitHub存储库可以在https://github.com/terry-xu-666/noderag上看到。

打破模态障碍：使用多模式LLM的通用嵌入学习

标题: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
作者: Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
日期: 2025-04-24
ArXiv主页: https://arxiv.org/abs/2504.17432
论文链接: https://arxiv.org/pdf/2504.17432
项目链接: https://garygutc.github.io/UniME/
gitHub仓库: https://github.com/deepglint/UniME

英文摘要

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM’s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

中文摘要

对比性语言图像预训练（剪辑）框架已成为多模式表示学习的一种广泛使用的方法，尤其是在图像文本检索和聚类中。但是，其功效受到三个关键局限性的限制：（1）文本令牌截断，（2）隔离的图像文本编码，以及（3）由于词袋行为而引起的缺陷组成性。尽管最近的多模式大型语言模型（MLLM）在广义视觉理解方面表现出了重大进步，但它们的学习可能转移的多模式表示的潜力仍然没有解放。在这项工作中，我们介绍了Unime（通用多模式嵌入），这是一个新颖的两阶段框架，这是一个利用MLLMS学习歧视性代表的两阶段框架。在第一阶段，我们从强大的基于LLM的教师模型中执行文本歧视性知识蒸馏，以增强MLLM语言组件的嵌入能力。在第二阶段，我们引入了坚硬的负面增强教学调整，以进一步提高歧视性表示学习。具体而言，我们最初减轻了虚假的负面污染，然后在每个批次中的每个实例中采样多个硬负污染物，迫使模型专注于挑战样本。这种方法不仅可以提高歧视能力，而且还提高了下游任务中的指导跟踪能力。我们对MMEB基准和多个检索任务进行了广泛的实验，包括简短和长标题的检索和组成检索。结果表明，UNIME在所有任务中都可以提高绩效的一致性，从而表现出卓越的判别和组成能力。

LIVECC：学习视频LLM，带有流式语音转录的大规模转录

标题: LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
作者: Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.16030
论文链接: https://arxiv.org/pdf/2504.16030
项目链接: https://showlab.github.io/livecc
gitHub仓库: https://github.com/showlab/livecc

英文摘要

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

中文摘要

最近的视频大型语言模型（视频LLM）通常取决于昂贵的人类注释或专有模型API（例如GPT-4O）来生成培训数据，从而限制了他们的培训。在本文中，我们探索了具有廉价自动语音识别（ASR）成绩单的视频LLM的大规模培训。具体来说，我们提出了一种新颖的流训练方法，该方法根据其时间戳密集地交织了ASR单词和视频帧。与先前在ASR视觉表示方面的研究相比，我们的方法自然符合ASR的流特性，从而使模型能够学习具有时间对准的，细颗粒的视觉模型。为了支持培训算法，我们引入了一条数据生产管道，以处理YouTube视频及其封闭字幕（CC，与ASR相同），从而导致实时培训和Live-Whisperx-526K数据集的Live-CC-5M数据集用于高质量监督的高品质的细调（SFT）。值得注意的是，即使没有SFT，仅ASR预先培训的LiVECC-7B基本模型也展示了竞争性的一般视频QA性能，并在实时视频评论中展示了新的功能。为了评估这一点，我们使用llm-as-a-a-gudge仔细设计了新的Livesports-3K基准测试，以衡量自由形式的评论。实验表明，我们的最终LIVECC-7B教学模型可以超越高级72B型号（QWEN2.5-VL-72B-INSTRUCT，LLAVA-VIDEO-72B），即使在实时模式下工作，也可以超越注释质量。同时，它在流行的视频基准（例如Videmomme和Ovobench）上以7b/8b量表获得最新的结果，证明了我们方法的广泛概括性。本文的所有资源均已在https://showlab.github.io/livecc上发布。

7B技术报告

标题: Trillion 7B Technical Report
作者: Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.15431
论文链接: https://arxiv.org/pdf/2504.15431

英文摘要

We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours ($148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B’s robust multilingual performance and exceptional cross-lingual consistency.

中文摘要

我们介绍了万亿名7B，这是最有效的韩国以韩国为中心的多语言LLM。我们新颖的跨语性文档关注（XLDA）机制可以使高效有效的知识转移到韩语和日语等目标语言。结合优化的数据混合物，特定于语言的过滤和量身定制的令牌构建，数万亿-7B可以实现竞争性能，同时仅将其2T培训代币的10 \％专用于多语言数据，仅需要59.4K H100 GPU小时（\ $ 148K）才能进行全面培训。四种语言的27个基准测试的全面评估表明，数万亿美元的强大多种表现和出色的跨语性一致性。

MIG：通过在语义空间中最大化信息增益来调整指令调整的自动数据选择

标题: MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
作者: Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen
日期: 2025-04-18
ArXiv主页: https://arxiv.org/abs/2504.13835
论文链接: https://arxiv.org/pdf/2504.13835
项目链接: https://yichengchen24.github.io/projects/mig
gitHub仓库: https://github.com/yichengchen24/MIG

英文摘要

Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.

中文摘要

数据质量和多样性是构建有效指导数据集的关键。％随着开源指导数据集的可用性的增加，自动从大量数据中自动选择高质量和多样的子集是有利的。％现有方法通常优先考虑实例质量并使用启发式规则来维持多样性。但是，对于整个收藏品的这种缺乏，通常会导致次优的结果。此外，启发式规则通常集中在嵌入空间内的距离或聚类上，该规则无法准确捕获语义空间中复杂说明的意图。％为了弥合此差距，我们提出了一种统一的方法来量化数据集的信息内容。该方法通过构造标签图并根据图中信息的分布来量化多样性来对语义空间进行建模。％基于这样的测量，我们进一步引入了一种有效的采样方法，该方法在语义空间中选择了迭代的数据样本以最大化信息增益（MIG）。在各种数据集和基本模型上的实验％实验表明，MIG始终优于最先进的方法。％值得注意的是，该模型用MIG采样的5 \％TULU3数据进行了微调，与在完整数据集中训练的官方SFT模型相当的性能可比性，在Alpacaeval上的改善为+5.73 \％，+6.89 \％在Wildbench上的提高。

Phybench：大语言模型中身体感知和推理的整体评估

标题: PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
作者: Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Muhan Zhang, Hua Xing Zhu
日期: 2025-04-22
ArXiv主页: https://arxiv.org/abs/2504.16074
论文链接: https://arxiv.org/pdf/2504.16074
项目链接: https://phybench-official.github.io/phybench-demo/
gitHub仓库: https://github.com/phybench-official/phybench

英文摘要

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

中文摘要

我们介绍了Phybench，这是一种新颖的高质量基准测试，旨在评估物理环境中大语言模型（LLM）的推理能力。Phybench由基于现实世界的物理场景精心策划的物理问题组成，旨在评估模型理解和推理现实物理过程的能力。涵盖力学，电磁，热力学，光学，现代物理和高级物理学，基准跨越了从高中练习到本科问题和物理学奥林匹克挑战的难度水平。此外，我们提出了表达式编辑距离（EED）评分，这是一种基于数学表达式之间的编辑距离的新颖评估度量，该距离有效地捕获了模型推理过程中的差异以及传统二进制评分方法以外的结果。我们评估了Phybench上的各种LLM，并将其与人类专家进行比较。我们的结果表明，即使是最先进的推理模型也会显着落后于人类专家，强调了他们的局限性以及在复杂的物理推理方案中的改善。我们的基准结果和数据集可在https://phybench-ficticial.github.io/phybench-demo/上公开获得。

OTC：通过加固学习的最佳工具调用

标题: OTC: Optimal Tool Calls via Reinforcement Learning
作者: Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji
日期: 2025-04-21
ArXiv主页: https://arxiv.org/abs/2504.14870
论文链接: https://arxiv.org/pdf/2504.14870

英文摘要

Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1% and improves tool productivity by up to 229.4%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

中文摘要

工具集成推理（TIR）增强了具有调用外部工具（例如搜索引擎和代码解释者）的能力的大型语言模型（LLMS），以解决超出仅语言推理功能的任务。尽管增强学习（RL）通过优化最终答案正确性表明了在改善TIR方面的希望，但现有方法通常会忽略与工具使用相关的效率和成本。这可能会导致次优行为，包括增加计算和财务开销的工具呼叫，或不足的工具使用损害答案质量的工具。在这项工作中，我们提出了最佳的工具呼叫控制策略优化（OTC-PO），这是一个简单而有效的基于RL的框架，可鼓励模型使用最小的工具调用来产生准确的答案。我们的方法引入了工具集成的奖励，该奖励共同考虑了正确性和工具效率，从而提高了高工具的生产率。我们在近端策略优化（PPO）和组相对偏好优化（GRPO）中实例化此框架，从而导致OTC-PPO和OTC-GRPO。在多个QA基准中使用QWEN-2.5和QWEN-MATH进行的实验表明，我们的方法可将工具调用降低高达73.1 \％，并提高工具的生产率高达229.4 \％，同时保持可比的答案准确性。据我们所知，这是第一个基于RL的框架，可以明确优化TIR中的工具使用效率。

X团队：具有自适应多代理的多转弯越狱和防御

标题: X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
作者: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
日期: 2025-04-15
ArXiv主页: https://arxiv.org/abs/2504.13203
论文链接: https://arxiv.org/pdf/2504.13203
项目链接: https://x-teaming.github.io
gitHub仓库: https://github.com/salman-lui/x-teaming

英文摘要

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

中文摘要

与语言模型（LMS）的多转交流相互作用构成关键的安全风险，因为有害意图可以战略性地分布在交易所中。然而，绝大多数先前的工作都集中在单转的安全上，而适应性和多样性仍然是多转型红色团队的关键挑战之一。为了应对这些挑战，我们提出了X-Teaming，这是一个可扩展的框架，该框架系统地探讨了看似无害的互动如何升级为有害结果并生成相应的攻击场景。X团队雇用协作代理人进行计划，攻击优化和验证，从而实现最先进的多转弯越狱效率和多样性，成功率在代表性领先的开放式和封闭式模型中，最高可达98.1％。特别是，X团队对最新的Claude 3.7十四行诗模型的攻击成功率达到96.2％，该模型几乎可以免疫单转弯攻击。在X团队的基础上，我们介绍了Xguard-Train，这是一个开源多转弯安全培训数据集，比以前的最佳资源大20倍，其中包括30k Interactive越狱，旨在启用强大的LMS多转移安全对准。我们的工作提供了基本工具和见解，以减轻复杂的对话攻击，从而提高LMS的多转弯安全性。

通过心理图像模拟，视觉模型中的透视感知推理

标题: Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
作者: Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung
日期: 2025-04-24
ArXiv主页: https://arxiv.org/abs/2504.17207
论文链接: https://arxiv.org/pdf/2504.17207
项目链接: https://apc-vlm.github.io/

英文摘要

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

中文摘要

我们通过心理图像模拟为视觉模型（VLMS）中的透视感知推理提供了一个框架。观点的观点，从替代角度看待环境或情况的能力，是人级视觉理解的关键基准，对于环境互动和与自主代理的协作至关重要。尽管在VLM中的空间推理方面取得了进步，但最近的研究表明，现代VLM显着缺乏意识到的推理能力，并且对以自我为中心的解释具有强烈的偏见。为了弥合VLM和人类感知之间的鸿沟，我们专注于心理图像的作用，在这里，人类通过促进观点转移的抽象表现来感知世界。在此激励的基础上，我们为透视感知推理提出了一个框架，称为抽象观点变化（APC），该框架有效地利用了视觉基础模型，例如对象检测，细分和方向估计，以构建场景抽象并启用透视转换。与各种VLM相比，我们对合成和实体基准测试的实验表明，通过我们的框架，透视感知推理的显着改善，进一步优于微调的空间推理模型和基于新颖的视图合成方法。

I-CON：代表学习的统一框架

标题: I-Con: A Unifying Framework for Representation Learning
作者: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton
日期: 2025-04-23
ArXiv主页: https://arxiv.org/abs/2504.16929
论文链接: https://arxiv.org/pdf/2504.16929

英文摘要

As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

中文摘要

随着代表学习领域的增长，不同的损失函数已经扩散以解决不同的问题。我们引入了一个单一的信息理论方程，该方程将在机器学习中概括了大量的现代损失功能。特别是，我们介绍了一个框架，该框架表明，几类机器学习方法可以精确地最大程度地减少两个条件分布之间的综合KL差异：监督和学习的表示。该观点揭示了隐藏的信息几何形状，频谱方法，降低维度，对比度学习和监督学习。该框架通过结合整个文献中的成功技术来实现新的损失功能的发展。我们不仅提供了各种各样的证据，连接了23种不同的方法，而且还利用了这些理论结果来创建最新的无监督图像分类器，这些分类器比对Imagenet-1K的先前的无监督分类的先前最先前的分类提高了8％。我们还证明，I-CON可用于得出有原则的证词方法，从而改善对比度表示学习者。

查看全文

http://www.dtcms.com/a/394508.html