当前位置：首页 > news >正文

【论文速递】2025年第15周(Apr-06-12)(Robotics/Embodied AI/LLM)

news 2025/9/22 11:08:44

中文使用 googletrans 翻译，翻译不对的地方以英文为准

SMOLVLM：重新定义小而有效的多峰模型
- 英文摘要
- 中文摘要
SMOLVLM：重新定义小而有效的多峰模型
- 英文摘要
- 中文摘要
OMNISVG：统一的可伸缩矢量图形生成模型
- 英文摘要
- 中文摘要
KIMI-VL技术报告
- 英文摘要
- 中文摘要
霍格维尔！推理：通过并发引起的平行LLM生成
- 英文摘要
- 中文摘要
一分钟的视频生成测试时间培训
- 英文摘要
- 中文摘要
DeepSeek-R1思维学：让我们<Think>关于LLM推理
- 英文摘要
- 中文摘要
Skywork R1V：开创性的多模式推理与经过思考链
- 英文摘要
- 中文摘要
重新思考预训练
- 英文摘要
- 中文摘要
Olmotrace：追踪语言模型输出到数万亿个培训令牌
- 英文摘要
- 中文摘要
DDT：解耦扩散变压器
- 英文摘要
- 中文摘要
GPT-4O图像产生能力的实证研究
- 英文摘要
- 中文摘要
C3PO：关键层，核心专家，协作途径优化，用于测试时间专家重新混合
- 英文摘要
- 中文摘要
访问量：通过视觉中的通用图像生成框架
- 英文摘要
- 中文摘要
VCR基础：一个综合的评估框架，用于视频链的推理
- 英文摘要
- 中文摘要
多湿基础台：一个多语言基准，用于解决问题
- 英文摘要
- 中文摘要
COIG-P：高质量和大规模的中国偏好数据集，以与人类价值保持一致
- 英文摘要
- 中文摘要
T1：小语言模型中的工具集成的自我验证
- 英文摘要
- 中文摘要
缺少前提加剧了过度思考：推理模型是否失去了批判性思维技能？
- 英文摘要
- 中文摘要
较少的概括：通过内在生成解锁更多的可控性
- 英文摘要
- 中文摘要
MM-IFENGINE：遵循多模式指令以下
- 英文摘要
- 中文摘要
乌雷卡：独特的区域标题任何内容
- 英文摘要
- 中文摘要
FantasyTalking：通过连贯运动合成的真实说话肖像产生
- 英文摘要
- 中文摘要
Megamath：推动开放数学语料库的限制
- 英文摘要
- 中文摘要
量化会损害推理吗？一项关于量化推理模型的实证研究
- 英文摘要
- 中文摘要
用于评估条件图像生成的统一代理框架
- 英文摘要
- 中文摘要

SMOLVLM：重新定义小而有效的多峰模型

标题: SmolVLM: Redefining small and efficient multimodal models
作者: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.05299
论文链接: https://arxiv.org/pdf/2504.05299
项目链接: https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7
gitHub仓库: https://github.com/huggingface/smollm

英文摘要

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

中文摘要

大型视觉模型（VLMS）提供出色的性能，但需要大量的计算资源，从而将其部署在移动设备和边缘设备上。较小的VLM通常会镜像较大模型的设计选择，例如广泛的图像令牌化，从而导致GPU存储器使用效率低下，并且在设备应用程序中的实用性有限。我们介绍了Smolvlm，这是一系列专门针对资源效率推理设计的紧凑多模型。我们系统地探索了针对低计算开销优化的建筑配置，令牌化策略和数据策划。通过此，我们确定了关键设计选择，以最少的记忆足迹在图像和视频任务上获得可观的性能。我们最小的模型SMOLVLM-256M在推断过程中使用的型号小于1GB GPU存储器，尽管有18个月的开发差距，但尽管存在18个月的开发差距，但较大的IDEFICS-80B模型均超过了300倍的IDEFICS-80B模型。我们最大的型号（在2.2b参数）中，可与最先进的VLMS竞争，消耗了GPU内存的两倍。SMOLVLM模型超出了静态图像，展示了强大的视频理解功能。我们的结果强调，战略性建筑优化，积极而有效的令牌化以及精心策划的培训数据可显着提高多模式性能，从而促进了在较小规模的明显较小的实用，节能部署。

SMOLVLM：重新定义小而有效的多峰模型

标题: SmolVLM: Redefining small and efficient multimodal models
作者: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.05299
论文链接: https://arxiv.org/pdf/2504.05299
项目链接: https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7
gitHub仓库: https://github.com/huggingface/smollm

英文摘要

中文摘要

OMNISVG：统一的可伸缩矢量图形生成模型

标题: OmniSVG: A Unified Scalable Vector Graphics Generation Model
作者: Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
日期: 2025-04-08
ArXiv主页: https://arxiv.org/abs/2504.06263
论文链接: https://arxiv.org/pdf/2504.06263
项目链接: https://omnisvg.github.io/
gitHub仓库: https://github.com/OmniSVG/OmniSVG

英文摘要

Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.

中文摘要

可扩展的向量图形（SVG）是图形设计中广泛采用的重要图像格式，因为它们具有分辨率的独立性和编辑性。生成高质量SVG的研究不断引起AIGC社区设计师和研究人员的关注。但是，现有方法要么产生具有巨大计算成本的非结构化输出，要么仅限于生成过度简化结构的单色图标。为了产生高质量且复杂的SVG，我们提出了OmniSVG，这是一个统一的框架，该框架利用预先训练的视觉模型（VLM）来端到端多模式SVG生成。通过将SVG命令参数化并坐标为离散令牌，Omnisvg将结构逻辑从低级几何形状中脱离，以有效地训练，同时保持复杂SVG结构的表现力。为了进一步推进SVG合成的开发，我们引入了MMSVG-2M，这是一个具有200万个注释SVG资产的多模式数据集，以及针对有条件的SVG生成任务的标准化评估协议。广泛的实验表明，OmniSVG优于现有方法，并证明了其将其集成到专业SVG设计工作流程中的潜力。

KIMI-VL技术报告

标题: Kimi-VL Technical Report
作者: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen
日期: 2025-04-10
ArXiv主页: https://arxiv.org/abs/2504.07491
论文链接: https://arxiv.org/pdf/2504.07491
gitHub仓库: https://github.com/MoonshotAI/Kimi-VL

英文摘要

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

中文摘要

我们提出了Kimi-vl，这是一种有效的开放源代码混合物（MOE）视觉模型（VLM），它提供了高级的多模式推理，长期理解和强大的代理能力 - 同时仅激活其语言解码器的2.8B参数（Kimi-i-vl-vl-a3b）。KIMI-VL在具有挑战性的域中表现出强大的性能：作为通用VLM，KIMI-VL在多转弯代理任务（例如OSWorld）中出色，匹配旗舰模型。此外，它在各种挑战性的视觉语言任务中表现出了非凡的功能，包括大学级图像和视频理解，OCR，数学推理和多图像理解。在比较评估中，它有效地与尖端的有效VLM竞争，例如GPT-4O-MINI，QWEN2.5-VL-7B和GEMMA-3-12B-IT，同时在几个关键域中超过GPT-4O。Kimi-vl在处理长篇小说和清楚地感知的方面也取得了进步。使用128K扩展上下文窗口，Kimi-vl可以处理多种多样的输入，在LongVideObench上获得64.5的令人印象深刻的分数，而Mmlongbench-Doc的分数为35.1。它的本地分辨率视觉编码器Moonvit进一步允许它看到和理解超高分辨率的视觉输入，在InfovQA上达到83.2，在ScreenSpot-Pro上获得34.5，同时维持常见任务的计算成本较低。在Kimi-vl的基础上，我们引入了一种先进的长期思考变体：Kimi-VL思维。该模型通过长期的经过思考（COT）监督的微调（SFT）和增强学习（RL）开发，具有强大的长途推理能力。它在MMMU上的得分为61.7，MathVision上的36.8，在MathVista上达到71.3，同时保持紧凑的2.8B激活的LLM参数，为有效的多模式思维模型设定了新的标准。代码和模型可在https://github.com/moonshotai/kimi-vl上公开访问。

霍格维尔！推理：通过并发引起的平行LLM生成

标题: Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
作者: Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh
日期: 2025-04-08
ArXiv主页: https://arxiv.org/abs/2504.06261
论文链接: https://arxiv.org/pdf/2504.06261
项目链接: https://eqimp.github.io/hogwild_llm/
gitHub仓库: https://gist.github.com/justheuristic/dd6032ad2dcbb6d4c97ab58e289592e7

英文摘要

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

中文摘要

大型语言模型（LLMS）已经证明了通过高级推理，长形成内容生成和工具使用来解决日益复杂的任务的能力。解决这些任务通常涉及长时间的推理时间计算。在人类问题解决中，加快工作的共同策略是协作：通过将问题分为子任务，同时探索不同的策略等。最近的研究表明，LLM还可以通过实施明确的合作框架，例如投票机制或明确创建可以并行执行的独立子任务来平行运行。但是，这些框架中的每一个都不适合所有类型的任务，这可能会阻碍其适用性。在这项工作中，我们提出了一种不同的设计方法：我们并行运行LLM“工人”，使他们能够通过同时升级的注意缓存同步，并促使这些工人决定如何最好地协作。我们的方法使实例可以提出他们自己的问题的协作策略，同时“看到”彼此在并发缓存中的部分进步。我们通过Hogwild实施这种方法！推理：平行的LLM推理引擎，其中相同LLM的多个实例与同一注意缓存并行运行，并且“即时”访问彼此的生成令牌。霍格维尔！推理利用旋转位置嵌入（绳索），以避免重新计算，同时改善并行硬件利用率。我们发现，具有现代推理能力的LLM可以通过开箱即用的共享键值缓存来执行推理，而无需进行其他微调。

一分钟的视频生成测试时间培训

标题: One-Minute Video Generation with Test-Time Training
作者: Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.05298
论文链接: https://arxiv.org/pdf/2504.05298
项目链接: https://test-time-training.github.io/video-dit/
gitHub仓库: https://github.com/test-time-training/ttt-video-dit

英文摘要

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

中文摘要

如今，变形金刚仍然难以生成一分钟的视频，因为自我发挥的层面对于长篇小说效率低下。诸如Mamba层之类的替代方案与复杂的多场景故事斗争，因为它们的隐藏状态不那么表现力。我们尝试测试时间培训（TTT）层，其隐藏状态本身可以是神经网络，因此更具表现力。将TTT图层添加到预训练的变压器中，使其能够从文本故事板中生成一分钟的视频。为了获得概念证明，我们策划了一个基于Tom和Jerry卡通的数据集。与Mamba〜2，门控的Deltanet和滑动窗口注意的层相比，TTT层产生了更连贯的视频，这些视频讲述了复杂的故事，在人类对每种方法100个视频的评估中，以34个ELO点为导致。尽管很有希望，但结果仍然包含伪影，这可能是由于预先训练的5B模型的能力有限。我们实施的效率也可以提高。由于资源限制，我们仅尝试了一分钟的视频，但是该方法可以扩展到更长的视频和更复杂的故事。示例视频，代码和注释可在以下网址找到：https：//test time-training.github.io/video-dit

DeepSeek-R1思维学：让我们关于LLM推理

标题: DeepSeek-R1 Thoughtology: Let’s about LLM Reasoning
作者: Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy
日期: 2025-04-02
ArXiv主页: https://arxiv.org/abs/2504.07128
论文链接: https://arxiv.org/pdf/2504.07128
项目链接: https://mcgill-nlp.github.io/thoughtology/
gitHub仓库: https://github.com/mcgill-NLP/thoughtology

英文摘要

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly “thinking” about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1’s basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a ‘sweet spot’ of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

中文摘要

诸如DeepSeek-R1之类的大型推理模型标志着LLM如何处理复杂问题的基本转变。DeepSeek-R1没有直接为给定输入提供答案，而是创建详细的多步推理链，在提供答案之前似乎对问题“思考”。用户公开使用此推理过程，为研究模型的推理行为和开放思想学领域创造了无尽的机会。从DeepSeek-R1的基本推理基本构件的分类学开始，我们对DeepSeek-R1的分析研究了思想长度的影响和可控性，长期或令人困惑的环境，文化和安全问题的管理以及DeepSeek-R1 Vis-\ a-Vis认知现象的状态，例如人类的语言处理和类似人类的世界模型。我们的发现描绘了细微的图片。值得注意的是，我们表明DeepSeek-R1具有“最佳选择”推理，额外的推理时间可以损害模型性能。此外，我们发现DeepSeek-R1持续反思以前探讨的问题制剂，阻碍进一步探索的趋势。我们还注意到，与其非争议的同行相比，DeepSeek-R1的强烈安全漏洞，这也可能损害与安全一致的LLM。

Skywork R1V：开创性的多模式推理与经过思考链

标题: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
作者: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou
日期: 2025-04-08
ArXiv主页: https://arxiv.org/abs/2504.05599
论文链接: https://arxiv.org/pdf/2504.05599
gitHub仓库: https://github.com/SkyworkAI/Skywork-R1V

英文摘要

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

中文摘要

我们介绍了Skywork R1V，这是一种多模式推理模型，通过有效的多模式传输方法将R1系列大型语言模型（LLM）扩展到视觉方式。Skywork R1V利用轻巧的视觉投影仪，促进了无缝的多模式适应，而无需重新训练基础语言模型或视觉编码器。为了加强视觉文本对齐，我们提出了一种混合优化策略，将迭代监督的微调（SFT）与小组相对策略优化（GRPO）相结合，从而显着提高了交叉模式的整合效率。此外，我们引入了一种自适应长度链条蒸馏方法，用于推理数据的生成。这种方法动态优化了推理链长度，从而提高了推理效率并防止过多的推理过度思考。经验评估表明，Skywork R1V只有38B参数，提供了竞争性能，在MMMU基准测试中获得了69.0的得分，而在Mathvista上的得分为67.5。同时，它保持了强大的文本推理性能，这在AIME上的72.0分数和Math500上的94.0得分证明了。Skywork R1V模型权重已公开发布，以促进开放性和可重复性。

重新思考预训练

标题: Rethinking Reflection in Pre-Training
作者: Essential AI, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski
日期: 2025-04-05
ArXiv主页: https://arxiv.org/abs/2504.04022
论文链接: https://arxiv.org/pdf/2504.04022
项目链接: https://www.essential.ai/blog/eval
gitHub仓库: https://github.com/Essential-AI/reflection

英文摘要

A language model’s ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model’s pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

中文摘要

语言模型反思自己推理的能力为解决复杂问题提供了关键优势。尽管最近的研究集中在强化学习过程中这种能力如何发展，但我们表明，在模型的预训练期间，它实际上开始更早地出现。为了研究这一点，我们将故意的错误引入了思想链中，并测试该模型是否仍然可以通过识别和纠正这些错误来得出正确的答案。通过跟踪跨预训练阶段的性能，我们观察到这种自我校正能力出现早期，并且随着时间的流逝而稳步改善。例如，在4万亿代币上预先训练的OLMO2-7B模型在我们的六个自我反射任务上显示自校正。

Olmotrace：追踪语言模型输出到数万亿个培训令牌

标题: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
作者: Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
日期: 2025-04-09
ArXiv主页: https://arxiv.org/abs/2504.07096
论文链接: https://arxiv.org/pdf/2504.07096
gitHub仓库: https://github.com/docker/buildx?tab=readme-ov-file#building-multi-platform-images

英文摘要

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

中文摘要

我们提出Olmotrace，这是第一个将语言模型输出追溯到实时的全千万训练数据的系统。Olmotrace在语言模型输出和培训文本语料库中的文档中找到并显示了逐字匹配。在Infini-gram的扩展版本（Liu等，2024）的支持下，我们的系统在几秒钟内返回跟踪结果。Olmotrace可以通过其培训数据的镜头来帮助用户了解语言模型的行为。我们展示了如何使用它来探索事实检查，幻觉和语言模型的创造力。Olmotrace已公开可用，并且完全开源。

DDT：解耦扩散变压器

标题: DDT: Decoupled Diffusion Transformer
作者: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang
日期: 2025-04-08
ArXiv主页: https://arxiv.org/abs/2504.05741
论文链接: https://arxiv.org/pdf/2504.05741
gitHub仓库: https://github.com/MCG-NJU/DDT

英文摘要

Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \color{ddtD}ecoupled \color{ddtD}iffusion \color{ddtT}ransformer~(\color{ddtDDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256times256, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly 4times faster training convergence compared to previous diffusion transformers). For ImageNet 512times512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

中文摘要

扩散变压器表现出了出色的产生质量，尽管需要更长的训练迭代和许多推理步骤。在每个降解步骤中，扩散变压器编码嘈杂的输入以提取低频语义分量，然后用相同的模块解码较高的频率。该方案产生了固有的优化难题：编码低频语义需要减少高频组件，从而在语义编码和高频解码之间产生张力。要解决这一挑战，我们提出了一种新的\ color {ddtd} ecoupled \ color {ddtd} iffusion \ color {ddtt} ransformer〜（\ color {\ color {ddtddtt}），并具有针对名义萃取的专用条件设计的专用条件设计的专用萃取量的专用设计。我们的实验表明，随着模型大小的增加，更实质性的编码器会产生性能的改善。对于Imagenet 256Times256，我们的DDT-XL/2实现了{1.31 FID}〜的新最新性能（与以前的扩散变压器相比，近4倍的训练收敛速度几乎更快）。对于ImageNet 512Times512，我们的DDT-XL/2实现了1.28的新最新FID。此外，作为一种有益的副产品，我们的解耦结构通过实现相邻的DeNoising步骤之间的共享自我条件来增强推理速度。为了最大程度地减少性能降解，我们提出了一种新型的统计动态编程方法，以识别最佳共享策略。

GPT-4O图像产生能力的实证研究

标题: An Empirical Study of GPT-4o Image Generation Capabilities
作者: Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi
日期: 2025-04-08
ArXiv主页: https://arxiv.org/abs/2504.05979
论文链接: https://arxiv.org/pdf/2504.05979

英文摘要

The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o’s image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.

中文摘要

从基于GAN的早期方法到扩散模型，再到最近试图弥合理解和生成任务的统一生成架构，形象生成的景观已经迅速发展。最近的进步，尤其是GPT-4O，证明了高保真多模式的可行性，其建筑设计仍然神秘且未出版。这提示了图像和文本生成是否已经成功集成到这些方法的统一框架中的问题。在这项工作中，我们对GPT-4O的图像产生功能进行了实证研究，从而对其进行了基准测试，以防止领先的开源和商业模型。我们的评估涵盖了四个主要类别，包括文本对图像，图像到图像，图像到3D和图像到X的生成，具有20多个任务。我们的分析强调了在各种环境下GPT-4O的优势和局限性，并将其置于生成建模的更广泛演变之内。通过这项调查，我们确定了未来统一生成模型的有希望的方向，强调了建筑设计和数据扩展的作用。

C3PO：关键层，核心专家，协作途径优化，用于测试时间专家重新混合

标题: C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
日期: 2025-04-10
ArXiv主页: https://arxiv.org/abs/2504.07964
论文链接: https://arxiv.org/pdf/2504.07964
gitHub仓库: https://github.com/tianyi-lab/C3PO

英文摘要

Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or “re-mixing” the experts in different layers jointly for each test sample. Since the test sample’s ground truth is unknown, we propose to optimize a surrogate objective defined by the sample’s “successful neighbors” from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts’ mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to “Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)”. We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE’s advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

中文摘要

专家的混合物（MOE）大语言模型（LLMS）患有严重的次优专家途径，我们的研究表明，从读取前学到的幼稚专家选择具有令人惊讶的10-20％的精度差距，以改善。在这一观察结果的推动下，我们开发了一种新型的测试时间优化方法，以重新权重或“重新混合”每个测试样本的不同层的专家。由于测试样本的地面真相尚不清楚，因此我们建议优化样本中的“成功邻居”定义的替代目标。我们基于模式找到，内核回归以及相似参考样本/任务的平均损失介绍了三种替代物和算法。为了降低优化整个途径的成本，我们仅将算法应用于核心专家在关键层中的混合权重，这些算法具有相似的性能，但可以节省大量计算。这导致“关键层，核心专家，协作途径优化（C3PO）”。我们将C3PO应用于最近的两个MOE LLM，并在六个广泛使用的基准测试中对其进行检查。它始终将基本模型的精度提高了7-15％，并且优于广泛使用的测试时间学习基准，例如，在秘密的学习和提示/前缀调谐，较大的余量。此外，C3PO启用具有1-3B活性参数的MOE LLMS以优于7-9b参数的LLM，从而提高MOE在效率方面的优势。我们的彻底消融研究进一步提供了有关在MOE上提高测试时间的新见解。

访问量：通过视觉中的通用图像生成框架

标题: VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
作者: Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng
日期: 2025-04-10
ArXiv主页: https://arxiv.org/abs/2504.07960
论文链接: https://arxiv.org/pdf/2504.07960
项目链接: https://visualcloze.github.io/
gitHub仓库: https://github.com/lzyhha/VisualCloze

英文摘要

Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

中文摘要

扩散模型的最新进展大大提高了各种图像生成任务。但是，当前的主流方法仍然专注于构建特定于任务的模型，这些模型在支持广泛的不同需求时的效率有限。尽管通用模型试图解决这一限制，但他们面临着关键的挑战，包括可推广的任务指导，适当的任务分布和统一的建筑设计。为了应对这些挑战，我们提出了一个通用图像生成框架，它支持了广泛的内域任务，对看不见的任务的概括，看不见的多个任务统一以及反向产生。与依赖基于语言的任务指令的现有方法，导致任务歧义和弱化，我们整合了视觉上的内在学习学习，从而允许模型从视觉演示中识别任务。同时，视觉任务分布的固有稀疏性阻碍了跨任务转移知识的学习。为此，我们介绍了Graph200K，这是一个图形结构化数据集，该数据集建立了各种相互关联的任务，增强了任务密度和可转移的知识。此外，我们发现，我们的统一图像生成公式与图像填充有一个一致的目标，使我们能够在不修改架构的情况下利用预训练的填充模型的强生生成率。

VCR基础：一个综合的评估框架，用于视频链的推理

标题: VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
作者: Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao
日期: 2025-04-10
ArXiv主页: https://arxiv.org/abs/2504.07956
论文链接: https://arxiv.org/pdf/2504.07956
项目链接: https://vlm-reasoning.github.io/VCR-Bench/
gitHub仓库: https://github.com/zhishuifeiqian/VCR-Bench

英文摘要

The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs’ Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs’ key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.

中文摘要

思想链（COT）推理的进步显着增强了大语言模型（LLMS）和大型视觉模型（LVLMS）的能力。但是，视频婴儿床推理的严格评估框架仍然没有。当前的视频基准无法充分评估推理过程，并暴露出失败是否源于感知或推理能力的缺陷。因此，我们介绍了VCR台，这是一种新颖的基准测试，旨在全面评估LVLMS的视频链链的推理能力。VCR台式由859个视频组成，涵盖了各种视频内容和持续时间，以及1,034个高质量的提问 - 答案对。每对都用逐步的婴儿床理由手动注释，其中每个步骤都标记为表明其与感知或推理能力的关联。此外，我们设计了七个不同的任务维度，并提出了COT评分，以根据逐步标记的COT理性评估整个COT过程。在VCR台上的广泛实验突出了当前LVLM的实质性局限性。即使是表现最佳的模型O1，也只能达到62.8％的COT分数和56.7％的精度，而大多数模型得分低于40％。实验表明，大多数模型的感知得分低于推理步骤，从而揭示了LVLMS在时间空间信息处理中用于复杂视频推理的钥匙瓶颈。COT评分和准确性之间的牢固正相关证实了我们的评估框架的有效性，并强调了COT推理在解决复杂的视频推理任务中的关键作用。我们希望VCR基础作为标准化评估框架，并在复杂的视频推理任务中揭示实际缺点。

多湿基础台：一个多语言基准，用于解决问题

标题: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
作者: Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang
日期: 2025-04-03
ArXiv主页: https://arxiv.org/abs/2504.02605
论文链接: https://arxiv.org/pdf/2504.02605
项目链接: https://multi-swe-bench.github.io
gitHub仓库: https://github.com/multi-swe-bench/multi-swe-bench

英文摘要

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

中文摘要

解决问题的任务是修改代码库以生成解决给定问题的补丁程序。但是，现有的基准（例如SWE-Bench）几乎只关注Python，因此不足以评估各种软件生态系统的大型语言模型（LLM）。为了解决这个问题，我们介绍了一个多语言问题解决的基准，称为Multi-Swe-Bench，涵盖Java，Typescript，JavaScript，GO，Rust，C，C和C ++。它总共包括1,632个高质量实例，由68位专家注释者仔细注释了2,456个候选人，以确保基准可以提供准确可靠的评估。基于多SWE基础台，我们使用三种代表性方法（无代理，SWE-Agent和OpenHands）评估了一系列最先进的模型，并通过关键的经验见解进行了全面的分析。此外，我们启动了一个多SWE-RL开源社区，旨在建立大规模增强学习（RL）培训数据集，以解决问题。作为最初的贡献，我们发布了一组4,723个结构良好的实例，涵盖了七种编程语言，为该领域的RL研究奠定了坚实的基础。更重要的是，我们开源整个数据生产管道以及详细的教程，鼓励开源社区不断贡献和扩展数据集。我们设想了我们的多湿床台和不断增长的多SWE-RL社区，作为促进RL朝着其全部潜力推进其潜力的催化剂，使我们更接近AGI的曙光。

COIG-P：高质量和大规模的中国偏好数据集，以与人类价值保持一致

标题: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
作者: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.05535
论文链接: https://arxiv.org/pdf/2504.05535
gitHub仓库: https://github.com/multimodal-art-projection/COIG-P

英文摘要

Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench liu2024alignbenchbenchmarkingchinesealignment show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.

中文摘要

将大型语言模型（LLM）与人类偏好保持一致，取得了巨大的成功。但是，现有的中国偏好数据集受小规模，狭窄的域覆盖范围以及缺乏严格的数据验证的限制。另外，对人类注释的指导和响应标签的依赖显着限制了人类偏好数据集的可扩展性。为了应对这些挑战，我们设计了一个基于LLM的中国偏好数据集注释管道，没有人类干预。具体而言，我们爬行并仔细地过滤了92K高质量的中国查询，并采用了15个主流LLM，以生成和评分选择的拒绝响应对。基于它，我们介绍了高质量的中国偏好数据集（中国开放指导通才 - 偏好），包括1,009k的中国偏好对，跨越了6个不同的领域：聊天，代码，数学，逻辑，新颖和角色。在Coig-P的基础上，为了减少使用LLM进行评分的开销，我们训练了一个8B大小的中国奖励模型（CRM），并精心构建了中国奖励基准（CRBENCH）。基于AlignBench LIU2024AlignBenchBench MarkenkingCheSealitement的评估结果表明，COIG-P明显优于其他中国偏好数据集，并且在QWEN2/2.5和Infinter-Instruct-Instruct-Instruct-inScruct-inScruct-inScruct-inScruct-inscruct-0625模型系列中带来了显着的性能改善。CRBENCH的结果表明，我们的CRM具有强大而强大的分数能力。我们将其应用于在COIG-P的测试拆分中过滤所选的响应对，我们的实验表明，在识别低质量样本的同时，在保持效率和成本效益的同时，它与GPT-4O相当。我们的代码和数据以https://github.com/multimodal-art-proctive/coig-p发布。

T1：小语言模型中的工具集成的自我验证

标题: T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
作者: Minki Kang, Jongwon Jeong, Jaewoong Cho
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.04718
论文链接: https://arxiv.org/pdf/2504.04718

英文摘要

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

中文摘要

最近的研究表明，测试时间计算缩放有效地改善了小语言模型（SLM）的性能。但是，先前的研究主要检查了测试时间计算缩放量表，以较大的模型作为验证者，而SLMS的自我验证却没有被抛弃。在这项工作中，我们研究了SLM是否可以可靠地自我验证其在测试时间缩放下的输出。我们发现，即使从较大的验证者那里进行知识蒸馏，SLM也会在需要记忆的验证任务中挣扎，例如数值计算和事实检查。为了解决这一限制，我们提出了工具集成的自我验证（T1），该工具集成的自我验证将记忆重验证的验证步骤委托给外部工具，例如代码解释器。我们的理论分析表明，工具集成减少了记忆需求并改善了测试时间缩放性能。数学基准测试的实验表明，在T1的测试时间缩放下，Llama-3.2 1B模型的表现优于明显更大的Llama-3.1 8B模型。此外，T1有效地概括了数学（MATH500）和多域知识密集型任务（MMLU-PRO）。我们的发现突出了工具集成的潜力，可以实质上提高SLM的自我验证能力。

缺少前提加剧了过度思考：推理模型是否失去了批判性思维技能？

标题: Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
作者: Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou
日期: 2025-04-09
ArXiv主页: https://arxiv.org/abs/2504.06514
论文链接: https://arxiv.org/pdf/2504.06514
gitHub仓库: https://github.com/tianyi-lab/MiP-Overthinking

英文摘要

We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law’’ but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models’ responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.

中文摘要

我们发现，无论是通过强化学习还是受到监督学习的训练，推理LLM的响应时间长度都大大增加了缺失前提（MIP）的问题，最终以多余和无效的思维。这种新引入的方案在很大程度上加剧了普遍的过度思考问题，我们将其称为MIP跨思考。这样的故障违反了``测试时间缩放定律’'，但在我们使用MIP策划的多个数据集中广泛观察到，这表明廉价过度思考和缺乏批判性思维的危害。令人惊讶的是，在MIP方案中未经专门培训的推理训练的LLM表现出更好的性能，产生了更短的响应，可以迅速识别出不适当的查询。这意味着当前推理LLM的培训配方的关键缺陷，这并不能充分鼓励有效的思维，从而导致滥用思维模式。为了进一步研究这种失败的原因，我们对推理长度，过度思考模式以及对不同类型的LLM的批判性思维的位置进行了细粒度分析。此外，我们的扩展消融研究表明，通过蒸馏推理模型的响应，过度思考具有感染性。这些结果提高了人们对缓解问题的过度思考和解散新见解的理解。

较少的概括：通过内在生成解锁更多的可控性

标题: Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
作者: Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He
日期: 2025-04-02
ArXiv主页: https://arxiv.org/abs/2504.02160
论文链接: https://arxiv.org/pdf/2504.02160
项目链接: https://bytedance.github.io/UNO/
gitHub仓库: https://github.com/bytedance/UNO

英文摘要

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

中文摘要

尽管由于其广泛的应用，因此在图像生成中已经广泛探索了受试者驱动的生成，但它仍然在数据可扩展性和扩展性方面面临挑战。对于第一个挑战，从策划单人物数据集转移到多个受试者的数据集并缩放它们特别困难。在第二个单个主体生成上的第二个方法中，最新方法使得在处理多主体方案时很难应用。在这项研究中，我们提出了一条高度一致的数据合成管道，以应对这一挑战。该管道利用扩散变压器的固有内在生成能力，并生成高谐波多主体配对数据。此外，我们介绍了UNO，其中包括进行性跨模式比对和通用旋转位置嵌入。这是一个从文本到图像模型进行迭代训练的多图像对象模型。广泛的实验表明，我们的方法可以达到高稠度，同时确保单个受试者和多主体驱动的生成中的可控性。

MM-IFENGINE：遵循多模式指令以下

标题: MM-IFEngine: Towards Multimodal Instruction Following
作者: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang
日期: 2025-04-10
ArXiv主页: https://arxiv.org/abs/2504.07957
论文链接: https://arxiv.org/pdf/2504.07957
项目链接: https://syuan03.github.io/MM-IFEngine/
gitHub仓库: https://github.com/SYuan03/MM-IFEngine

英文摘要

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2%), MIA (+7.6%), and IFEval (+12.3%). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.

中文摘要

以下指令（如果）能力来衡量多模式的大语言模型（MLLM）准确地理解用户在告诉他们什么以及他们是否正确地做到这一点。培训数据之后的现有多模式指令稀缺，基准测试很简单，具有原子指令，并且评估策略对于要求精确输出约束的任务不精确。为了解决这个问题，我们提出了MM-IFENGINE，这是一种有效的管道，以生成高质量的图像指导对。我们的MM-IFENGINE管道可产生大规模，多样化和高质量的培训数据MM-IFinstruct-23K，适用于监督微调（SFT），并以MM-IFDPO-23K扩展，以直接偏好优化（DPO）。我们进一步介绍了MM-IFEVAL，这是一种具有挑战性且多样化的多模式指令遵循的基准，其中包括（1）既构成输出响应的构成级别的约束，又包括与输入图像相关的感知级别的约束，以及（2）全面的评估管道，既包含了规则的评估和法官评估和法官模型。我们进行SFT和DPO实验，并证明MM-IFINSTRUCT-23K上的微调MLLM和MM-IFDPO-23K在各种基准测试基准（例如MM-ifeVal（+10.2％），MIA（+7.6％）和IFEVAL（+12.3％）上取得了显着增长。完整的数据和评估代码将在https://github.com/syuan03/mm-ifengine上发布。

乌雷卡：独特的区域标题任何内容

标题: URECA: Unique Region Caption Anything
作者: Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.05305
论文链接: https://arxiv.org/pdf/2504.05305
项目链接: https://cvlab-kaist.github.io/URECA/
gitHub仓库: https://github.com/cvlab-kaist/URECA

英文摘要

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

中文摘要

区域级字幕旨在为特定图像区域生成自然语言描述，同时突出其区别特征。但是，现有的方法难以在多范围内产生独特的标题，从而限制了其现实世界的适用性。为了满足详细的区域级别理解的需求，我们介绍了UECA数据集，这是一个针对多晶体区域字幕的大规模数据集。与主要关注显着对象的先前数据集不同，Ureca数据集通过合并各种对象，零件和背景元素来确保区域和字幕之间的独特且一致的映射。核心是阶段的数据策划管道，每个阶段都会逐步完善区域选择和字幕生成。通过在每个阶段利用多模式的大语言模型（MLLM），我们的管道以提高的准确性和语义多样性产生独特的和上下文的字幕。在此数据集的基础上，我们提出了乌雷卡（Ureca），这是一种新颖的字幕模型，旨在有效编码多晶状体区域。尿素通过对现有MLLM的简单而有影响力的修改来维持基本的空间特性，例如位置和形状，从而实现了精细元素和语义丰富的区域描述。我们的方法引入了动态面具建模和高分辨率掩模编码器，以增强字幕唯一性。实验表明，乌雷卡（Ureca）在乌雷卡（Ureca）数据集上实现了最先进的性能，并将其概括为现有的区域级字幕基准。

FantasyTalking：通过连贯运动合成的真实说话肖像产生

标题: FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
作者: Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.04842
论文链接: https://arxiv.org/pdf/2504.04842
项目链接: https://fantasy-amap.github.io/fantasy-talking/
gitHub仓库: https://github.com/Fantasy-AMAP/fantasy-talking

英文摘要

Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.

中文摘要

从单个静态肖像中创建一个可逼真的动画头像仍然具有挑战性。现有的方法通常难以捕获微妙的面部表情，相关的全球身体运动和动态背景。为了解决这些局限性，我们提出了一个新型框架，该框架利用了验证的视频扩散变压器模型来生成具有可控运动动力学的高保真性，连贯的说话肖像。我们工作的核心是双阶段的视听策略。在第一阶段，我们采用剪辑级训练方案来建立连贯的全局运动，通过整个场景中的音频驱动动态，包括参考肖像，上下文对象和背景。在第二阶段，我们使用唇部跟踪掩码在框架级别提炼唇部运动，从而确保与音频信号的精确同步。为了保持身份而不损害运动灵活性，我们用面部注重的跨意识模块代替了常用的参考网络，该模块在整个视频中有效地保持面部一致性。此外，我们集成了一个运动强度调制模块，该模块明确控制了表达和身体运动强度，从而可以控制对超出唇部运动的肖像运动的控制。广泛的实验结果表明，我们提出的方法可以通过更好的现实主义，连贯性，运动强度和身份保存来达到更高的质量。我们的项目页面：https：//fantasy-amap.github.io/fantasy-talking/。

Megamath：推动开放数学语料库的限制

标题: MegaMath: Pushing the Limits of Open Math Corpora
作者: Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing
日期: 2025-04-03
ArXiv主页: https://arxiv.org/abs/2504.02807
论文链接: https://arxiv.org/pdf/2504.02807
项目链接: https://huggingface.co/datasets/LLM360/MegaMath
gitHub仓库: https://github.com/LLM360/MegaMath

英文摘要

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

中文摘要

数学推理是人类智能的基石，也是大型语言模型（LLMS）高级功能的关键基准。但是，研究社区仍然缺乏针对以数学为中心的LLM预培训的开放，高质量的语料库。我们提出了Megamath，这是一个通过以下实践策划的开放数据集，该数据集是通过以下实践策划的：（1）重新审视Web数据：我们从常见的爬网中重新提取了数学文档，并使用以数学为导向的HTML优化，基于FastText的过滤和删除来获取Internet上的较高质量数据。（2）召回与数学相关的代码数据：我们从大型代码培训语料库（Stack-V2）中确定了与数学相关的高质量代码，从而进一步增强了数据多样性。（3）探索综合数据：我们合成了QA风格的文本，与数学相关的代码以及从Web数据或代码数据中进行交织的文本代码块。通过整合这些策略并通过广泛的消融验证其有效性，Megamath在现有的开放数学预训练数据集中提供了371B代币，其数量和最高质量。

量化会损害推理吗？一项关于量化推理模型的实证研究

标题: Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
作者: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou
日期: 2025-04-07
ArXiv主页: https://arxiv.org/abs/2504.04823
论文链接: https://arxiv.org/pdf/2504.04823

英文摘要

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

中文摘要

推理语言模型的最新进展表明，在复杂的任务中表现出色，但其扩展的经过思考的推理过程增加了推理开销。尽管已广泛采用量化以降低大语言模型的推理成本，但其对推理模型的影响仍在研究中。在这项研究中，我们对量化的推理模型进行了首次系统研究，评估了开源的DeepSeek-R1-Distild Qwen和Llama家族的范围从1.5B到70B参数，以及QWQ-32B。我们的研究涵盖了使用最新的位宽度算法的重量，KV缓存和激活量化，并在数学（AIME，MATH-500），Scientific（GPQA）和编程（LiveCodeBench）推理基准基准的数学（AIME，MATH-500），Scientific（GPQA）进行了广泛的评估。我们的发现表明，尽管W8A8或W4A16量化可以实现无损量化，但较低的位宽度会带来显着的准确性风险。我们进一步将模型大小，模型来源和任务难度确定为性能的关键决定因素。与期望相反，量化模型不会显示出增加的输出长度。此外，从策略性地扩展模型大小或推理步骤可以有效地提高性能。所有量化的模型和代码都将在https://github.com/ruikangliu/quantized-reasoning-models中开源。

用于评估条件图像生成的统一代理框架

标题: A Unified Agentic Framework for Evaluating Conditional Image Generation
作者: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
日期: 2025-04-09
ArXiv主页: https://arxiv.org/abs/2504.07046
论文链接: https://arxiv.org/pdf/2504.07046
项目链接: https://x.com/wangly0229/status/1910317936042295737?t=PR7EH5eB_NgTFgSUKZbSvA&s=19
gitHub仓库: https://github.com/HITsz-TMG/Agentic-CIGEval

英文摘要

Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval’s capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.

中文摘要

有条件的图像产生因其个性化内容的能力而引起了极大的关注。但是，该领域在开发任务不合时宜，可靠且可解释的评估指标方面面临挑战。本文介绍了Cigeval，这是一个统一的代理框架，用于全面评估有条件的图像生成任务。CIGEVAL利用大型多模型（LMM）作为其核心，集成了多功能工具箱并建立了细粒度的评估框架。此外，我们合成评估轨迹进行微调，赋予较小的LMMS自主选择适当的工具并根据工具输出进行细微的分析。七个突出的条件图像生成任务进行的实验表明，CIGEVAL（GPT-4O版本）与人类评估的高度相关性高度为0.4625，与0.47的通道间相关性紧密匹配。此外，当仅使用2.3k训练轨迹使用7B开源LMMS实施时，Cigeval超过了以前的基于GPT-4O的最新方法。GPT-4O图像生成的案例研究突出了Cigeval在确定与主体一致性和依从性控制指导相关的微妙问题方面的能力，这表明其具有自动化评估图像生成任务的巨大潜力。

查看全文

http://www.dtcms.com/a/394341.html

设计模式简单说明：责任链与规则树

自动备份脚本 mysql_hourly_backup.sh

SuperGLUE：自然语言理解的挑战与进步

线程安全的单例模式、自旋锁，以及读者写者问题

U盘长期插在电脑上的影响

Windows 系统部署 PaddleOCR —— 基于 EPGF 架构

数据一致性指的是什么？如何实现数据一致性？

初识消息队列的世界

Python快速入门专业版（三十八）：Python字典：键值对结构的增删改查与进阶用法

SpringCloudOAuth2+JWT：微服务统⼀认证方案

LeetCode 分类刷题：2517. 礼盒的最大甜蜜度

深度学习优化器进阶：从SGD到AdamW，不同优化器的适用场景

C++ 之【C++的IO流】

truffle学习笔记

现代循环神经网络

vlc播放NV12原始视频数据

ThinkPHP8学习篇(七)：数据库(三)

链家租房数据爬虫与可视化项目 Python Scrapy+Django+Vue 租房数据分析可视化机器学习预测算法聚类算法✅

MQTT协议知识点总结

C++ 类和对象·其一

TypeScript里的类型声明文件

【LeetCode - 每日1题】设计电影租借系统

Java进阶教程，全面剖析Java多线程编程，线程安全，笔记12

DCC-GARCH模型与代码实现

实验3掌握 Java 如何使用修饰符，方法中参数的传递，类的继承性以及类的多态性

【本地持久化】功能-总结

深入浅出现代FPU浮点乘法器设计

LinkedHashMap 访问顺序模式

破解K个最近点问题的深度思考与通用解法

链式结构的特性

目录

SMOLVLM：重新定义小而有效的多峰模型

英文摘要

中文摘要

SMOLVLM：重新定义小而有效的多峰模型

英文摘要

中文摘要

OMNISVG：统一的可伸缩矢量图形生成模型

英文摘要

中文摘要

KIMI-VL技术报告

英文摘要

中文摘要

霍格维尔！推理：通过并发引起的平行LLM生成

英文摘要

中文摘要

一分钟的视频生成测试时间培训

英文摘要

中文摘要

DeepSeek-R1思维学：让我们关于LLM推理

英文摘要

中文摘要

Skywork R1V：开创性的多模式推理与经过思考链

英文摘要

中文摘要

重新思考预训练

英文摘要

中文摘要

Olmotrace：追踪语言模型输出到数万亿个培训令牌

英文摘要

中文摘要

DDT：解耦扩散变压器

英文摘要

中文摘要

GPT-4O图像产生能力的实证研究

英文摘要

中文摘要

C3PO：关键层，核心专家，协作途径优化，用于测试时间专家重新混合

英文摘要

中文摘要

访问量：通过视觉中的通用图像生成框架

英文摘要

中文摘要

VCR基础：一个综合的评估框架，用于视频链的推理

英文摘要

中文摘要

多湿基础台：一个多语言基准，用于解决问题

英文摘要

中文摘要

COIG-P：高质量和大规模的中国偏好数据集，以与人类价值保持一致

英文摘要

中文摘要

T1：小语言模型中的工具集成的自我验证

英文摘要

中文摘要

缺少前提加剧了过度思考：推理模型是否失去了批判性思维技能？

英文摘要

中文摘要

较少的概括：通过内在生成解锁更多的可控性

英文摘要

中文摘要

MM-IFENGINE：遵循多模式指令以下

英文摘要

中文摘要

乌雷卡：独特的区域标题任何内容

英文摘要

中文摘要

FantasyTalking：通过连贯运动合成的真实说话肖像产生

英文摘要

中文摘要

Megamath：推动开放数学语料库的限制

英文摘要

中文摘要

量化会损害推理吗？一项关于量化推理模型的实证研究

英文摘要

中文摘要

用于评估条件图像生成的统一代理框架

英文摘要

中文摘要

相关文章：