【论文速递】2025年第20周(May-11-17)(Robotics/Embodied AI/LLM)
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- SEED1.5-VL技术综述
- 英文摘要
- 中文摘要
- minimax语音:带有可学习的扬声器编码器的内在零击文本到语音
- 英文摘要
- 中文摘要
- 超越“啊哈!”:在大型推理模型中朝着系统的元能力对齐
- 英文摘要
- 中文摘要
- BLIP3-O:一个完全开放统一的多模型架构,培训和数据集的家族
- 英文摘要
- 中文摘要
- 语言模型的平行缩放定律
- 英文摘要
- 中文摘要
- MIMO:解锁语言模型的推理潜力 - 从训练到训练后
- 英文摘要
- 中文摘要
- 系统提示通过元学习优化
- 英文摘要
- 中文摘要
- 对DeepSeek-V3的见解:对AI架构的硬件的扩展挑战和反思
- 英文摘要
- 中文摘要
- Bielik V3小:技术报告
- 英文摘要
- 中文摘要
- STEP1X-3D:迈向高保真和可控的纹理3D资产
- 英文摘要
- 中文摘要
- Bielik 11B V2技术报告
- 英文摘要
- 中文摘要
- MathCoder-VL:增强多模式数学推理的桥接视觉和代码
- 英文摘要
- 中文摘要
- 在推理模型中向同龄人学习
- 英文摘要
- 中文摘要
- 拒绝:开放式摄影症感知的脱钩学习
- 英文摘要
- 中文摘要
- 统一连续生成模型
- 英文摘要
- 中文摘要
- Openthinkimg:通过视觉工具增强学习学习图像进行思考
- 英文摘要
- 中文摘要
- Lightlab:通过扩散模型控制图像中的光源
- 英文摘要
- 中文摘要
- WorldPM:扩展人类偏好建模
- 英文摘要
- 中文摘要
- DanceGrpo:在视觉上释放GRPO
- 英文摘要
- 中文摘要
- Skywork-VL奖励:多式联运理解和推理的有效奖励模型
- 英文摘要
- 中文摘要
- Refine-AF:一个任务无关的框架,通过使用自动反馈中的强化学习来通过自我生成的说明对齐语言模型
- 英文摘要
- 中文摘要
- 注意力激发:采用注意力头对弱到较弱的预训练数据选择
- 英文摘要
- 中文摘要
- COT百科全书:分析,预测和控制推理模型将如何思考
- 英文摘要
- 中文摘要
- 万寿菊:基于扩散的图像发生器的负担得起的适应图像分析
- 英文摘要
- 中文摘要
- Univla:学习以以任务为中心的潜在动作在任何地方
- 英文摘要
- 中文摘要
SEED1.5-VL技术综述
- 标题: Seed1.5-VL Technical Report
- 作者: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen, Ke Wang, Keyu Pan, Kun Zhang, Kunchang Li, Lanxin Li, Lei Li, Lei Shi, Li Han, Liang Xiang, Liangqiang Chen, Lin Chen, Lin Li, Lin Yan, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Wu, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Ru Zhang, Rui Wang, Rui Yang, Rui Zhao, Shaoqiang Xu, Shihao Liang, Shipeng Yan, Shu Zhong, Shuaishuai Cao, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Songhua Cai, Tenglong Ao, Tianhao Yang, Tingting Zhang, Wanjun Zhong, Wei Jia, Wei Weng, Weihao Yu, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xiao Li, Xiaolei Zhu, Xiaoying Jia, Xijin Zhang, Xin Liu, Xinchen Zhang, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuefeng Xiao, Xujing Li, Yan Wu, Yawei Wen, Yifan Du, Yihao Zhang, Yining Ye, Yonghui Wu, Yu Liu, Yu Yue, Yufeng Zhou, Yufeng Yuan, Yuhang Xu, Yuhong Yang, Yun Zhang, Yunhao Fang, Yuntao Li, Yurui Ren, Yuwen Xiong, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song
- 日期: 2025-05-11
- ArXiv主页: https://arxiv.org/abs/2505.07062
- 论文链接: https://arxiv.org/pdf/2505.07062
- 项目链接: https://seed.bytedance.com/en/tech/seed1_5_vl
- gitHub仓库: https://github.com/ByteDance-Seed/Seed1.5-VL
英文摘要
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
中文摘要
我们提出了SEED1.5-VL,这是一种视觉基础模型,旨在提高通用多模式的理解和推理。SEED1.5-VL由532M参数视觉编码器和20B活动参数的Experts(MOE)LLM组成。尽管具有相对紧凑的建筑,但它在广泛的公共VLM基准和内部评估套件中提供了良好的性能,在60个公共基准中的38台实现了最先进的性能。此外,在以代理为中心的任务(例如GUI控制和游戏玩法)中,SEED1.5-VL胜过领先的多模式系统,包括OpenAI Cua和Claude 3.7。除了视觉和视频理解之外,它还表现出强大的推理能力,使其对于视觉拼图等多模式推理挑战特别有效。我们认为,这些功能将赋予各种任务跨越更广泛的应用程序。在本报告中,我们主要对在各个阶段进行模型设计,数据构建和培训中建立种子1.5-VL的经验进行全面审查,希望该报告能够激发进一步的研究。现在可以在https://www.volcengine.com/上访问Seed1.5-Vl
minimax语音:带有可学习的扬声器编码器的内在零击文本到语音
- 标题: MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
- 作者: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07916
- 论文链接: https://arxiv.org/pdf/2505.07916
- 项目链接: https://minimax-ai.github.io/tts_tech_report/
英文摘要
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.
中文摘要
我们介绍了Minimax语音,这是一种基于自回归变压器的文本对语音(TTS)模型,该模型生成高质量的语音。一个关键的创新是我们可学习的扬声器编码器,它可以从参考音频中提取音色功能而无需转录。这使Minimax语音能够以零拍的方式与引用一致的音色产生高度表达性的语音,同时还支持与参考语音具有异常高的相似性的一击语音克隆。此外,通过拟议的流VAE增强了合成音频的整体质量。我们的模型支持32种语言,并在多个客观和主观评估指标中表现出出色的性能。值得注意的是,它在客观的语音克隆指标(单词错误率和说话者的相似性)上实现了最新的(SOTA)结果,并确保了公共TTS Arena排行榜上的最高位置。由说话者编码器的稳健和散开表示授予的Minimax语音的另一个关键优势在于它的可扩展性而无需修改基本模型,从而实现了各种应用,例如:通过lora进行任意语音情感控制;通过直接从文本描述中合成音色特征来进行语音(T2V)的文本(T2V);和专业的语音克隆(PVC)通过微调音色功能以及其他数据。我们鼓励读者访问https://minimax-ai.github.io/tts_tech_report获取更多示例。
超越“啊哈!”:在大型推理模型中朝着系统的元能力对齐
-
标题: Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
-
作者: Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li
-
日期: 2025-05-15
-
ArXiv主页: https://arxiv.org/abs/2505.10554
-
论文链接: https://arxiv.org/pdf/2505.10554
-
gitHub仓库: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
英文摘要
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model’s “aha moment”. However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs’ reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental “aha moments”. Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
中文摘要
大型推理模型(LRMS)已经具有长期思想推理的潜在能力。先前的工作表明,基于结果的增强学习(RL)可以偶然引起先进的推理行为,例如自我纠正,回溯和验证现象,通常称为模型的“ AHA时刻”。但是,这些紧急行为的时机和一致性仍然是不可预测和无法控制的,从而限制了LRMS推理能力的可扩展性和可靠性。为了解决这些限制,我们不仅可以依赖提示和偶然的“啊哈时刻”。取而代之的是,我们使用自动生成的,可自我验证的任务明确将模型与三个元能力相结合:扣除,归纳和绑架。我们的三阶段二线个人对齐,参数空间合并和域特异性增强学习,相对于指导调节的基线,将超过10 \%的性能提高了10 \%。此外,对齐检查点的域特异性RL在跨数学,编码和科学基准的性能上限的平均增益中获得了2 \%的平均增长,这表明显式的元能力对齐为推理提供了可扩展可靠的基础。代码可在以下网址找到:https://github.com/zhiyuanhubj/meta-ability-Alignment
BLIP3-O:一个完全开放统一的多模型架构,培训和数据集的家族
-
标题: BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
-
作者: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
-
日期: 2025-05-14
-
ArXiv主页: https://arxiv.org/abs/2505.09568
-
论文链接: https://arxiv.org/pdf/2505.09568
-
gitHub仓库: https://github.com/JiuhaiChen/BLIP3o
英文摘要
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.
中文摘要
在最近对多模型模型的研究中,统一的图像理解和产生引起了人们的关注。尽管已经对图像理解的设计选择进行了广泛的研究,但具有图像生成的统一框架的最佳模型架构和培训配方仍未得到充实。由自回旋和扩散模型具有高质量生成和可伸缩性的强大潜力,我们对它们在统一的多模式环境中的使用进行了全面研究,重点是图像表示,建模目标和培训策略。基于这些研究,我们引入了一种新颖的方法,该方法采用扩散变压器来生成语义上富含的剪辑图像特征,与传统的基于VAE的表示相反。该设计既可以提高训练效率,又提高了生成质量。此外,我们证明了统一模型的顺序预处理策略,即对图像理解的第一培训,以及随后在图像生成器上的实用优势,通过在开发强大的图像生成能力的同时保留图像理解能力。最后,我们通过提示带有各种字幕的GPT-4O来介绍各种场景,对象,人类手势等,仔细策划了一个高质量的指令调查数据集Blip3O-60K,以生成图像生成图像。在我们创新的模型设计,培训配方和数据集的基础上,我们开发了Blip3-O,这是一套最先进的统一多模型。Blip3-O在大多数流行的基准测试中都达到了跨越图像理解和发电任务的卓越性能。为了促进未来的研究,我们将模型完全开放,包括代码,模型权重,培训脚本以及训练和指导调谐数据集。
语言模型的平行缩放定律
-
标题: Parallel Scaling Law for Language Models
-
作者: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
-
日期: 2025-05-15
-
ArXiv主页: https://arxiv.org/abs/2505.10475
-
论文链接: https://arxiv.org/pdf/2505.10475
-
gitHub仓库: https://github.com/QwenLM/ParScale
英文摘要
It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model’s parallel computation during both training and inference time. We apply P diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the P outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P) while showing superior inference efficiency. For example, ParScale can use up to 22times less memory increase and 6times less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.
中文摘要
通常认为,缩放语言模型应通过增加参数(参数缩放)或输出令牌(推进时间缩放)来实现大量的空间或时间成本。我们介绍了第三个也是更高的推理缩放范式:在训练和推理时间内增加模型的并行计算。我们将P多样化和可学习的转换应用于输入,并行执行模型的正向通过,并动态汇总P输出。此方法,即并行缩放(parscale),通过重复现有参数来扩展并行计算,并可以应用于任何模型结构,优化过程,数据或任务。从理论上讲,我们提出了新的缩放定律,并通过大规模的预训练对其进行验证,这表明具有P平行流的模型类似于通过O(log P)缩放参数,同时显示出卓越的推理效率。例如,与可以提高性能相同的参数缩放相比,parscale最多可以减少22倍的记忆增加,而延迟的延迟增加少了6倍。它还可以通过对少量令牌进行培训后的培训后将现成的预培训模型回收为相平均的模型,从而进一步降低了培训预算。我们发现的新扩展法可能有助于在低资源场景中部署更强大的模型,并为计算在机器学习中的作用提供了另一种观点。
MIMO:解锁语言模型的推理潜力 - 从训练到训练后
-
标题: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining
-
作者: Xiaomi LLM-Core Team, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
-
日期: 2025-05-12
-
ArXiv主页: https://arxiv.org/abs/2505.07608
-
论文链接: https://arxiv.org/pdf/2505.07608
-
gitHub仓库: https://github.com/XiaomiMiMo/MiMo
英文摘要
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model’s reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
中文摘要
我们提出了MIMO-7B,这是一种用于推理任务的大型语言模型,在训练前和训练后阶段都进行了优化。在预训练期间,我们会增强数据预处理管道,并采用三阶段数据混合策略来增强基本模型的推理潜力。MIMO-7B基碱在25万亿代币上进行了预训练,并具有增强性能和加速推理速度的其他多句预测目标。在培训后,我们策划了130K可验证的数学和编程问题的数据集,以加强学习,集成了测试 - 缺陷驱动的代码奖励方案,以减轻稀疏回报的问题并采用战略数据重新采样以稳定培训。广泛的评估表明,MIMO-7B基础具有出色的推理潜力,甚至超过了更大的32B模型。最终的RL调整模型MIMO-7B-RL在数学,代码和一般推理任务上取得了卓越的性能,超过了Openai O1-Mini的性能。该模型检查点可在https://github.com/xiaomimo/mimo上找到。
系统提示通过元学习优化
-
标题: System Prompt Optimization with Meta-Learning
-
作者: Yumin Choi, Jinheon Baek, Sung Ju Hwang
-
日期: 2025-05-14
-
ArXiv主页: https://arxiv.org/abs/2505.09666
-
论文链接: https://arxiv.org/pdf/2505.09666
-
gitHub仓库: https://github.com/Dozi01/MetaSPO
英文摘要
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
中文摘要
大型语言模型(LLMS)显示出了出色的功能,并通过优化其输入提示在最大化其性能方面发挥了关键作用。但是,虽然LLM提示既包括任务无关系统的提示和特定于任务的用户提示,但是“提示优化”的现有工作集中在特定于单个查询或任务的用户提示上,并且在很大程度上忽略了系统提示,该系统提示一旦优化,适用于不同的任务和域。在此激励的情况下,我们介绍了二聚体系统提示优化的新颖问题,其目的是设计系统提示,这些提示可对多样化的用户提示并可以转移到看不见的任务。为了解决此问题,我们提出了一个元学习框架,该框架通过在多个数据集的各种用户提示中优化系统提示,同时以迭代方式更新用户提示,以确保它们之间的协同作用。我们对跨越5个不同域的14个看不见的数据集进行了实验,我们在其上表明我们的方法会产生系统提示,从而有效地推广到不同的用户提示。另外,我们的发现表明,优化的系统提示可以使快速适应甚至可以看不见任务,而对于测试时间用户提示的优化步骤更少,同时实现了提高性能。
对DeepSeek-V3的见解:对AI架构的硬件的扩展挑战和反思
- 标题: Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- 作者: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei
- 日期: 2025-05-14
- ArXiv主页: https://arxiv.org/abs/2505.09343
- 论文链接: https://arxiv.org/pdf/2505.09343
英文摘要
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3’s development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
中文摘要
大型语言模型(LLM)的快速缩放已在当前的硬件体系结构中揭示了关键限制,包括记忆容量,计算效率和互连带宽的限制。DeepSeek-V3接受了2,048个NVIDIA H800 GPU培训,展示了硬件感知模型共同设计如何有效地应对这些挑战,从而实现了成本效益的培训和规模的推论。This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to最小化集群级网络开销。在DeepSeek-V3的开发过程中遇到的硬件瓶颈的基础上,我们与学术和行业同行进行了更广泛的讨论,讨论了潜在的未来硬件方向,包括精确的低精度计算单元,扩大和扩展融合,以及低层次通信面料的创新。这些见解强调了硬件和模型共同设计在满足AI工作负载不断升级的需求方面的关键作用,从而为下一代AI系统提供了实用的蓝图。
Bielik V3小:技术报告
- 标题: Bielik v3 Small: Technical Report
- 作者: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
- 日期: 2025-05-05
- ArXiv主页: https://arxiv.org/abs/2505.02550
- 论文链接: https://arxiv.org/pdf/2505.02550
- 项目链接: https://bielik.ai/
- gitHub仓库: https://github.com/speakleash/speakleash
英文摘要
We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.
中文摘要
我们介绍了Bielik V3,这是针对波兰语言处理优化的一系列参数有效的生成文本模型(1.5B和4.5B)。这些模型表明,较小,优化的体系结构可以实现与较大较大对应物相当的性能,同时需要更少的计算资源。我们的方法结合了几项关键创新:一种自定义的波兰令牌(APT4),可显着提高令牌效率,加权教学跨透明术损失,以跨教学类型的平衡学习以及根据培训进度动态调整的自适应学习率。这些型号经过经过精心策划的策划的语料库的培训,这些语料库跨越了3.03亿个文档,这些型号跨越了多个基准,包括开放式PL LLM排行榜,复杂的波兰文本理解基准测试,波兰EQ-Bench和波兰医疗排行榜。4.5b参数模型与其大小的模型达到了2-3倍的竞争,而1.5B模型尽管其非常紧凑的轮廓,但表现出色。这些进步为使用较少代表性的语言建立了针对参数有效语言建模的新基准,从而使高质量的波兰语言AI更容易访问资源受限的应用程序。
STEP1X-3D:迈向高保真和可控的纹理3D资产
- 标题: Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
- 作者: Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07747
- 论文链接: https://arxiv.org/pdf/2505.07747
- 项目链接: https://stepfun-ai.github.io/Step1X-3D/
- gitHub仓库: https://github.com/stepfun-ai/Step1X-3D
英文摘要
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
中文摘要
尽管生成人工智能在文本,图像,音频和视频域中都显着提高,但由于基本挑战,例如数据稀缺,算法限制和生态系统分散,3D一代仍然相对欠发达。为此,我们提出了STEP1X-3D,这是一个开放式框架,通过以下方式解决这些挑战:(1)严格的数据策划管道处理> 5M资产,以创建具有标准化的几何和纹理属性的2M高质量数据集;(2)将混合VAE-DIT几何发生器与基于扩散的纹理合成模块结合的两阶段3D本地结构;(3)模型,培训代码和适应模块的完整开源发布。对于几何产生,混合VAE-DIT组件通过使用基于感知器的潜在编码和尖锐的边缘采样来产生TSDF表示,以保存细节。然后,基于扩散的纹理合成模块通过几何条件和潜在空间同步确保了跨视图的一致性。基准结果表明,最先进的性能超过了现有的开源方法,同时还可以通过专有解决方案实现竞争性质量。值得注意的是,该框架通过支撑2D控制技术〜(例如Lora)的直接传输到3D合成来独特地桥接2D和3D代范式。通过同时提高数据质量,算法保真度和可重复性,Step1x-3D旨在为可控3D资产生成的开放研究建立新的标准。
Bielik 11B V2技术报告
- 标题: Bielik 11B v2 Technical Report
- 作者: Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas
- 日期: 2025-05-05
- ArXiv主页: https://arxiv.org/abs/2505.02410
- 论文链接: https://arxiv.org/pdf/2505.02410
- 项目链接: https://bielik.ai/
- gitHub仓库: https://github.com/speakleash/speakleash
英文摘要
We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model’s parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.
中文摘要
我们提出Bielik 11B V2,这是一种针对波兰文本处理优化的最先进的语言模型。该模型建立在Mistral 7b V0.2体系结构上,并使用深度尺度缩放到11B参数,在维持强大的跨语义功能的同时,在波兰语言基准中展示了出色的性能。我们介绍了两项关键的技术创新:加权指令跨凝结损失,通过将基于质量的权重分配给培训示例,以及根据上下文长度进行动态调整,从而优化了各种教学类型的学习。跨多个基准测试的全面评估表明,Bielik 11B V2的表现要优于许多较大的模型,包括具有2-6倍参数的模型,并且在从语言理解到复杂推理的任务上大量超过了其他专业的波兰语模型。该模型的参数效率和广泛的量化选项可以在各种硬件配置上进行部署,推进波兰语言AI功能,并建立新的基准,用于以较少有代表性的语言进行资源有效的语言建模。
MathCoder-VL:增强多模式数学推理的桥接视觉和代码
- 标题: MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
- 作者: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
- 日期: 2025-05-15
- ArXiv主页: https://arxiv.org/abs/2505.10557
- 论文链接: https://arxiv.org/pdf/2505.10557
- 项目链接: https://mathllm.github.io/mathvision/
- gitHub仓库: https://github.com/mathllm/MathCoder
英文摘要
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
中文摘要
自然语言图像捕获数据集(广泛用于训练大型多模式模型),主要关注自然场景,并忽略了对解决问题至关重要的数学数字的复杂细节,这阻碍了当前LMMS在多模式数学推理中的进步。为此,我们建议将代码作为跨模式对齐的监督,因为代码本质地编码了生成相应数字所需的所有信息,并在两种模态之间建立了精确的连接。具体而言,我们将图像对代码模型和数据集共同开发与模型中的方法,从而产生了图像对代码模型,figcodifier和imgcode-8.6m数据集,这是日期最大的图像代码数据集。此外,我们利用Figcodifier综合了新型数学数字,然后构建MM-MATHINSTRUCT-3M,这是一种高质量的多模式数学指令微调数据集。最后,我们提出了MathCoder-VL,该MathCoder-Vl接受了IMGCODE-8.6M训练以进行跨模式对齐,随后对MM-Mathinstruct-3M进行了微调,用于多模式数学问题解决。我们的模型在所有六个指标中都实现了新的开源SOTA。值得注意的是,它超过了MathVista的几何解决子集中的GPT-4O和Claude 3.5十四行诗,可提高8.9%和9.2%。数据集和模型将在https://github.com/mathllm/mathcoder上发布。
在推理模型中向同龄人学习
- 标题: Learning from Peers in Reasoning Models
- 作者: Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07787
- 论文链接: https://arxiv.org/pdf/2505.07787
- 项目链接: https://learning-from-peers.github.io/
- gitHub仓库: https://github.com/tongxuluo/LeaP
英文摘要
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the “Prefix Dominance Trap”. Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP’s robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .
中文摘要
大型推理模型(LRMS)也具有自我校正的能力,即使他们在推理道路上犯错误。但是,我们的研究表明,当推理过程以短暂但糟糕的开端开始时,模型很难恢复。我们将此现象称为“前缀优势陷阱”。受到心理发现的启发,即同伴相互作用可以促进自我纠正而不会产生负面影响的个体,我们建议**从同龄人**(leap)中学习以解决这一现象。具体而言,每个令牌,每个推理路径概述了其中间推理,并通过路由机制与他人共享,从而使路径在推理过程中融合了同伴见解。但是,我们观察到较小的模型有时无法遵循摘要和反射指示。为了解决这个问题,我们将它们微调到我们的**leap-t **模型系列中。AIME 2024,AIME 2025,AIMO 2025和GPQA Diamond的实验显示了LEAP提供了实质性改进。例如,具有LEAP的QWQ-32B的平均值比基线高出近5个绝对点,并且在三个数学基准测试基准上超过了DeepSeek-R1-671B,平均增益为3.3分。值得注意的是,我们的微调LEAP-T-7B在AIME 2024上的DeepSeek-R1-Distill-Qwen-14b的性能匹配。深度分析揭示了Leap通过及时的同伴洞察力来纠正LEAP的鲁棒性误差,显示出强大的误差容忍度,并且处理强大的误差和处理的任务困难。LEAP通过使LRM在推理期间合作来标志着一个里程碑。我们的代码,数据集和模型可在https://learning-from-peers.github.io/上找到。
拒绝:开放式摄影症感知的脱钩学习
-
标题: DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
-
作者: Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian
-
日期: 2025-05-07
-
ArXiv主页: https://arxiv.org/abs/2505.04410
-
论文链接: https://arxiv.org/pdf/2505.04410
-
gitHub仓库: https://github.com/xiaomoguhz/DeCLIP
英文摘要
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and
context’’ features respectively. The content'' features are aligned with image crop representations to improve local discriminability, while
context’’ features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at magenta{https://github.com/xiaomoguhz/DeCLIP}.
中文摘要
密集的视觉预测任务受到其对预定义类别的依赖的限制,从而限制了它们在视觉概念无限的现实情况下的适用性。虽然视觉模型(VLM)像夹子(VLM)在开放式摄影任务中表现出了希望,但由于本地特征表示的限制,它们在密集预测中的直接应用通常会导致次优性能。在这项工作中,我们介绍了我们的观察结果,即夹子的图像令牌难以有效地从空间或语言相关区域汇总信息,从而导致缺乏局部可区分性和空间一致性的特征。为了解决这个问题,我们提出了deckip,这是一个新颖的框架,通过解开自我发项模块以分别获得content''和
context’‘特征来增强剪辑。内容''特征与图像作物表示相一致,以提高本地可区分性,而
context’'特征学会在视觉基础模型(例如dino)的指导下保留空间相关性。广泛的实验表明,在多个开放式摄影密集的预测任务(包括对象检测和语义分割)中,偏差明显优于现有方法。代码可在Magenta {https://github.com/xiaomoguhz/declip}上找到。
统一连续生成模型
-
标题: Unified Continuous Generative Models
-
作者: Peng Sun, Yi Jiang, Tao Lin
-
日期: 2025-05-12
-
ArXiv主页: https://arxiv.org/abs/2505.07447
-
论文链接: https://arxiv.org/pdf/2505.07447
-
gitHub仓库: https://github.com/LINs-lab/UCGM
英文摘要
Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.
中文摘要
连续生成模型的最新进展,包括诸如扩散和流程匹配(通常需要8-1000个采样步骤)和诸如一致性模型(通常为1-8个步骤)之类的多步方法(通常需要8-1000个采样步骤),具有令人印象深刻的生成性能。但是,现有工作通常将这些方法视为不同的范式,从而产生了单独的培训和抽样方法。我们引入了一个统一的框架,用于培训,抽样和分析这些模型。我们的实现,统一的连续生成模型培训师和采样器(UCGM- {T,S}),实现了最新的(SOTA)性能。例如,在ImageNet 256x256上,使用675m扩散变压器,UCGM-T训练多步型模型,以20步以20步实现1.30 FID,并且只需2个步骤即可达到1.42 FID。此外,将UCGM-S应用于预训练的模型(以前为1.26 FID,在250步中)仅在40个步骤中将性能提高到1.06 FID。代码可在以下网址获得:https://github.com/lins-lab/ucgm。
Openthinkimg:通过视觉工具增强学习学习图像进行思考
-
标题: OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
-
作者: Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng
-
日期: 2025-05-13
-
ArXiv主页: https://arxiv.org/abs/2505.08617
-
论文链接: https://arxiv.org/pdf/2505.08617
-
gitHub仓库: https://github.com/zhaochen0110/OpenThinkIMG
英文摘要
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely “think with images”.
中文摘要
尽管人类可以灵活地利用交互式的视觉认知来解决复杂的问题解决方案,从而使大型视觉模型(LVLMS)使用视觉工具学习类似的适应性行为仍然很具有挑战性。一个重大障碍是目前缺乏标准化的基础架构,这阻碍了整合多种工具,生成丰富的交互数据和有效培训稳定的代理。为了解决这些差距,我们介绍了OpenthInkimg,这是第一个开源,全面的端到端端到端框架,用于工具增强的LVLM。它具有标准化的视觉工具接口,可扩展的轨迹生成用于政策初始化以及灵活的培训环境。此外,在静态演示中考虑监督的微调(SFT)为动态工具调用提供了有限的政策概括,我们提出了一种新颖的加固学习(RL)框架V-Toolrl来培训LVLMS以学习自适应政策,以调用外部视觉工具。V-Toolrl使LVLM可以通过使用工具交互中的反馈直接优化任务成功来自主发现最佳的工具使用策略。我们从经验上验证了V-Toolrl在具有挑战性的图表推理任务上。我们的RL训练的代理建立在QWEN2-VL-2B上,显着优于其SFT启动的对应物(+28.83分),并超过了已建立的受监督的工具学习基线,例如Taco和Cogcom,平均得出+12.7分。值得注意的是,它还超过了诸如GPT-4.1(+8.68的精度点)(例如GPT-4.1)的突出封闭式模型。我们希望Openthinkimg可以作为推进动态,工具增强视觉推理的基础框架,帮助社区开发可以真正“使用图像思考”的AI代理。
Lightlab:通过扩散模型控制图像中的光源
- 标题: LightLab: Controlling Light Sources in Images with Diffusion Models
- 作者: Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, Yedid Hoshen
- 日期: 2025-05-14
- ArXiv主页: https://arxiv.org/abs/2505.09608
- 论文链接: https://arxiv.org/pdf/2505.09608
- 项目链接: https://nadmag.github.io/LightLab/
英文摘要
We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for relighting. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.
中文摘要
我们提出了一种简单但有效的基于扩散的方法,用于对图像中光源进行细粒度的参数控制。现有的重新考虑方法要么依赖于多个输入视图来在推理时执行反相反渲染,要么无法对光更改提供明确的控制。我们的方法微调在一小部分真实的原始照片对上进行了扩散模型,并以合成的渲染图像进行了大规模补充,以引起其光真逼真的先验。我们利用光的线性来合成图像对,描绘了目标光源或环境照明的受控光变化。使用这些数据和适当的微调方案,我们训练模型,以明确控制光强度和颜色,以进行精确的照明变化。最后,我们展示了我们的方法如何获得引人注目的光编辑结果,并根据用户喜好优于现有方法。
WorldPM:扩展人类偏好建模
- 标题: WorldPM: Scaling Human Preference Modeling
- 作者: Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
- 日期: 2025-05-15
- ArXiv主页: https://arxiv.org/abs/2505.10527
- 论文链接: https://arxiv.org/pdf/2505.10527
英文摘要
Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM’s scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.
中文摘要
通过在语言建模中缩放定律的促进,这些定律表明测试损失如何用模型和数据集大小作为幂定律,我们发现优先型建模中存在类似的法律。我们建议世界偏好建模$(Worldpm)强调这种扩展潜力,在这种潜力中,世界偏好体现了人类偏好的统一表示。在本文中,我们从涵盖各种用户社区的公共论坛中收集偏好数据,并在范围从1.5B到72B参数的模型中使用15m规模的数据进行广泛的培训。我们在不同的评估指标上观察到不同的模式:(1)对抗性指标(识别欺骗性特征的能力)始终如一地扩大训练数据和基本模型大小;(2)客观指标(具有明确答案的客观知识)在较大的语言模型中显示出紧急行为,突出了WorldPM的可伸缩性潜力;(3)主观指标(来自有限数量的人类或AI的主观偏好)未显示缩放趋势。进一步的实验验证了WorldPM作为偏好微调的基础的有效性。通过对具有20个子任务的7个基准测试的评估,我们发现WorldPM广泛提高了各种大小(7K,100K和800K样品)的人类偏好数据集的概括性能,并且在许多关键子任务中的性能增长超过5%。将WorldPM整合到我们的内部RLHF管道中,我们观察到内部和公共评估集的显着改善,在我们的内部评估中显着增长了4%至8%。
DanceGrpo:在视觉上释放GRPO
- 标题: DanceGRPO: Unleashing GRPO on Visual Generation
- 作者: Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07818
- 论文链接: https://arxiv.org/pdf/2505.07818
- 项目链接: https://dancegrpo.github.io/
- gitHub仓库: https://github.com/XueZeyue/DanceGRPO
英文摘要
Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.
中文摘要
在生成模型中的最新突破,主要扩散模型和整流的流量彻底改变了视觉内容的创造,但是将模型输出与人类偏好保持一致仍然是一个关键的挑战。现有的强化学习(RL)基于视觉生成的方法面临临界局限性:与现代普通微分方程(ODES)基于基于的采样范式的不相容性,大规模培训中的不稳定性以及缺乏视频生成的验证。This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX,Skyreel-i2v)和五个奖励模型(图像/视频美学,文本图像对齐,视频运动质量和二进制奖励)。据我们所知,DanceGrpo是第一个基于RL的统一框架,能够在各种生成范式,任务,基础模型和奖励模型中进行无缝适应。DanceGrpo表现出一致且实质性的改进,在HPS-V2.1,剪辑得分,VideoAlign和Geneval等基准上,其表现优于181%。值得注意的是,DanceGrpo不仅可以稳定复杂的视频生成的策略优化,而且还可以使生成策略更好地捕获DeNo的轨迹,从而获得最佳推理缩放,并从稀疏的二进制反馈中学习。我们的结果将DanceGrpo建立为强大而多功能的解决方案,用于从人类反馈(RLHF)任务中扩展增强学习,从而为协调强化学习和视觉合成提供了新的见解。代码将发布。
Skywork-VL奖励:多式联运理解和推理的有效奖励模型
- 标题: Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
- 作者: Xiaokun Wang, Chris, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07263
- 论文链接: https://arxiv.org/pdf/2505.07263
英文摘要
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
中文摘要
我们提出了Skywork-VL奖励,这是一种多式联运奖励模型,可为多模式理解和推理任务提供奖励信号。我们的技术方法包括两个关键组成部分:首先,我们构建了一个大规模的多模式偏好数据集,该数据集涵盖了广泛的任务和场景,并从标准视觉语言模型(VLMS)和高级VLM推理器中收集了响应。其次,我们设计了一个基于QWEN2.5-VL-7B-INSTRUCTION的奖励模型架构,集成了奖励头并使用成对排名数据对成对排名损失进行多阶段微调。实验评估表明,Skywork-VL奖励可以在多模式VL-Reward-Rewardbench上获得最新的结果,并在仅文本奖励基准的基准上表现出竞争性能。此外,基于我们的SkyWork-VL奖励构建的偏好数据证明对训练混合偏好优化(MPO)非常有效,从而导致多模式推理能力的显着提高。我们的结果强调了Skywork-VL奖励,这是对多模式一致性的通用,可靠的奖励模型的重大进步。我们的模型已公开发布,以促进透明度和可重复性。
Refine-AF:一个任务无关的框架,通过使用自动反馈中的强化学习来通过自我生成的说明对齐语言模型
- 标题: REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
- 作者: Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal
- 日期: 2025-05-10
- ArXiv主页: https://arxiv.org/abs/2505.06548
- 论文链接: https://arxiv.org/pdf/2505.06548
英文摘要
Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.
中文摘要
基于指导的大语言模型(LLM)已被证明在许多少数或零弹性的自然语言处理(NLP)任务中有效。但是,创建人类注销的教学数据是耗时的,昂贵的,并且通常限制数量和任务多样性。以前的研究努力试图通过提出能够直接从模型本身以半自动和任务无关的方式生成指令的框架来应对这一挑战。这些努力中有许多依赖于仅基于参数的大型模型,例如GPT-3.5(175b),它们昂贵,并且受到许多查询的限制。本文探讨了三个开源小型LLM的性能,例如Llama 2-7b,Llama 2-13b和Mistral 7b,使用半自动化框架,从而减少人力干预,努力和成本来生成指令数据集以进行微调LLMS。此外,我们证明,将基于强化学习(RL)培训算法纳入基于LLMS的框架会导致进一步增强。我们对数据集的评估表明,与以前的方法相比,这些基于RL的框架在63-66%的任务方面取得了重大改进。
注意力激发:采用注意力头对弱到较弱的预训练数据选择
- 标题: AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
- 作者: Kai Hua, Steven Wu, Ge Zhang, Ke Shen
- 日期: 2025-05-12
- ArXiv主页: https://arxiv.org/abs/2505.07293
- 论文链接: https://arxiv.org/pdf/2505.07293
英文摘要
Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs’ complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.
中文摘要
最近,人们越来越有兴趣收集推理密集型预处理数据以提高LLMS的复杂推理能力。先前的方法通常依靠有监督的分类器来识别需要人类或LLMS标记的此类数据,通常会引入特定于域的偏见。由于注意力头针对中文推理至关重要,因此我们提出了注意力影响,这是一种简单但有效的无训练方法,没有监督信号。我们的方法使一个小的审核语言模型通过简单的注意力掩盖操作充当强大的数据选择器。具体来说,我们确定检索头并计算掩盖这些头时的损失差异。我们将注意力促进性应用于1.3B参数密集的模型,以对241B代币的Smollm语料库进行数据选择,并将Smollm语料库与所选子集混合,该子集包括73B令牌,以使用1T训练代币和WSD学习率计划使用7B参数密集模型。我们的实验结果表明,在几个知识密集的和推理的基准(即MMLU,MMLU-PRO,Agieval-EN,GSM8K和HumaneVal)中,从1.4pp到3.5pp的实质性改进。这证明了有效的弱到较大的缩放属性,小型模型改善了较大模型的最终性能,从而为以推理为中心的数据选择有前途且可扩展的路径。
COT百科全书:分析,预测和控制推理模型将如何思考
- 标题: The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
- 作者: Seongyun Lee, Seungone Kim, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo
- 日期: 2025-05-15
- ArXiv主页: https://arxiv.org/abs/2505.10185
- 论文链接: https://arxiv.org/pdf/2505.10185
英文摘要
Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.
中文摘要
长期的经营链(COT)是有效使用现代大型语言模型的重要组成部分,但是我们对这些能力基础的推理策略的理解仍然有限。尽管一些先前的作品试图使用预定义的策略类型对COTS进行分类,但这种方法受到人类直觉的约束,无法捕获模型行为的全部多样性。在这项工作中,我们介绍了COT百科全书,这是一个自下而上的框架,用于分析和转向模型推理。我们的方法会自动从模型生成的COTS中提取不同的推理标准,将它们嵌入语义空间中,将它们插入代表性类别,并得出对比性的编译以解释推理行为。人类评估表明,与现有方法相比,该框架会产生更容易解释和全面的分析。此外,我们证明了这种理解能够提高绩效:我们可以预测模型可能会使用哪种策略并指导其朝着更有效的替代方案进行指导。最后,我们提供实用的见解,例如训练数据格式(例如,自由形式与多项选择)对推理行为的影响要比数据域具有更大的影响,从而强调了格式感知模型设计的重要性。
万寿菊:基于扩散的图像发生器的负担得起的适应图像分析
- 标题: Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis
- 作者: Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, Konrad Schindler
- 日期: 2025-05-14
- ArXiv主页: https://arxiv.org/abs/2505.09358
- 论文链接: https://arxiv.org/pdf/2505.09358
- 项目链接: https://marigoldcomputervision.github.io
- gitHub仓库: https://github.com/prs-eth/Marigold
英文摘要
The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models’ ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model’s architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: https://marigoldcomputervision.github.io
中文摘要
在过去十年中,深度学习在计算机视觉中的成功取决于大型标记的数据集和强大的验证模型。在数据筛分设置中,这些预审预告片的模型的质量对于有效的转移学习至关重要。传统上,图像分类和自我监督的学习是预处理CNN和基于变压器架构的主要方法。最近,文本到图像生成模型的兴起,尤其是那些在潜在空间中使用DeNoising扩散的模型,它引入了一类新的基础模型,该模型在大规模的字幕图像数据集中训练。这些模型能够产生看不见的内容的现实图像的能力表明它们对视觉世界有深刻的了解。在这项工作中,我们提出了Marigold,这是一个有条件的生成模型家族和微调协议,该方案从预处理的潜在扩散模型中提取知识,例如稳定的扩散,并适应了它们的密集图像分析任务,包括单眼深度估计,表面正态的预测和内在的分解。万寿菊要求在几天内在单个GPU上使用小型合成数据集的训练预训练的潜扩散模型的架构最少修改,并展示了最先进的零拍概括。项目页面:https://marigoldcomputervision.github.io
Univla:学习以以任务为中心的潜在动作在任何地方
-
标题: UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
-
作者: Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
-
日期: 2025-05-09
-
ArXiv主页: https://arxiv.org/abs/2505.06111
-
论文链接: https://arxiv.org/pdf/2505.06111
-
gitHub仓库: https://github.com/OpenDriveLab/UniVLA
英文摘要
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA’s potential to facilitate scalable and efficient robot policy learning.
中文摘要
通才机器人应在各种环境中有效执行。但是,大多数现有方法都在很大程度上依赖于扩展动作注释的数据来增强其功能。因此,它们通常仅限于单个物理规范,并难以在不同的实施方案和环境中学习可转移的知识。为了应对这些局限性,我们提出了Univla,这是学习跨境视觉语言行动(VLA)策略的新框架。我们的关键创新是从具有潜在动作模型的视频中得出以任务为中心的动作表示。这使我们能够利用各种实施方案和观点来利用广泛的数据。为了减轻任务 - 无关紧要的动态的效果,我们将语言说明结合起来,并在Dino特征空间中建立潜在的动作模型。从互联网规模的视频中学到的,通才政策可以通过有效的潜在动作解码部署到各种机器人。我们在多个操纵和导航基准以及实体部署中获得最新的结果。Univla在少于1/20的计时计算和1/10的下游数据中,取得了优于OpenVLA的表现。训练管道中也将连续的性能改进视为异质数据,甚至包括人类视频。结果强调了Univla促进可扩展有效的机器人政策学习的潜力。