【论文速递】2025年第27周(Jun-29-Jul-05)(Robotics/Embodied AI/LLM)
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- GLM-4.1V思维:通过可扩展增强学习的多功能多模式推理
- 英文摘要
- 中文摘要
- KWAI KEYE-VL技术报告
- 英文摘要
- 中文摘要
- webailor:浏览Web代理的超人类推理
- 英文摘要
- 中文摘要
- 用图像进行多模式推理的思考:基础,方法和未来的前沿
- 英文摘要
- 中文摘要
- longanimation:具有动态全局内存的长期动画生成
- 英文摘要
- 中文摘要
- 数学推理可以提高一般LLM功能吗?了解LLM推理的可转移性
- 英文摘要
- 中文摘要
- 基于能量的变压器是可扩展的学习者和思想家
- 英文摘要
- 中文摘要
- BlenderFusion:3D接地的视觉编辑和生成合成
- 英文摘要
- 中文摘要
- OVIS-U1技术报告
- 英文摘要
- 中文摘要
- langscene-X:与Trimap视频扩散重建可概括的3D语言包裹的场景
- 英文摘要
- 中文摘要
- Skywork-Reward-V2:通过人类协同作用扩展偏好数据策划
- 英文摘要
- 中文摘要
- 深度在任何情况下
- 英文摘要
- 中文摘要
- 螺旋:零和游戏上的自我玩法通过多代理多转弯强化学习激励推理
- 英文摘要
- 中文摘要
- Sciarena:科学文献任务中基础模型的开放评估平台
- 英文摘要
- 中文摘要
- 径向关注:o(nlog n)长时间视频的能量衰减稀疏的注意力
- 英文摘要
- 中文摘要
- 听取内在的声音:通过中间功能反馈对控制网络训练对齐
- 英文摘要
- 中文摘要
- 一项关于视觉动作模型的调查:动作令牌化的观点
- 英文摘要
- 中文摘要
- 书法家:自由式文本图像自定义
- 英文摘要
- 中文摘要
- MOCA:模态感知持续的预训练使双向多模式嵌入更好
- 英文摘要
- 中文摘要
- Llava-Scissor:带有语义连接的组件的象征性压缩,用于视频LLMS
- 英文摘要
- 中文摘要
- Intfold:用于一般和专业生物分子结构预测的可控基础模型
- 英文摘要
- 中文摘要
- VMOBA:视频扩散模型的混合物注意
- 英文摘要
- 中文摘要
- diffucoder:理解和改进代码生成的蒙版扩散模型
- 英文摘要
- 中文摘要
- 视觉语言模型是否具有内部世界模型?进行原子评估
- 英文摘要
- 中文摘要
- xverse:通过DIT调制对身份和语义属性的一致多主体控制
- 英文摘要
- 中文摘要
GLM-4.1V思维:通过可扩展增强学习的多功能多模式推理
-
标题: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
-
作者: Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianle Gong, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
-
日期: 2025-07-01
-
ArXiv主页: https://arxiv.org/abs/2507.01006
-
论文链接: https://arxiv.org/pdf/2507.01006
-
gitHub仓库: https://github.com/THUDM/GLM-4.1V-Thinking
英文摘要
We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.
中文摘要
我们提出GLM-4.1V思维,这是一种旨在推进通用多种模束推理的视觉模型(VLM)。在本报告中,我们在以推理为中心的培训框架的开发中分享了我们的关键发现。我们首先通过大规模的预训练开发具有巨大潜力的能力的视觉基础模型,这可以说是最终表现的上限。通过课程抽样(RLC)进行的强化学习,然后释放了模型的全部潜力,从而导致各种任务的全面能力增强,包括STEM问题解决,视频理解,内容识别,基础,基于GUI的代理,基于GUI的代理商以及长期的文档理解等。为了促进该领域的研究,我们开源GLM-4.1V-9B思维,这在相当大小的模型之间实现了最先进的性能。在对28个公共基准的全面评估中,我们的模型在几乎所有任务上都优于QWEN2.5-VL-7B,并且在18个基准上,相对于显着较大的QWEN2.5-VL-72B,在18个基准方面具有可比性甚至卓越的性能。值得注意的是,与诸如GPT-4O之类的封闭源模型(包括长期的文档理解和STEM推理)等封闭源模型相比,GLM-4.1V-9B认为也表现出竞争性或卓越的性能,进一步强调了其强大的功能。代码,模型和更多信息将在https://github.com/thudm/glm-4.1v-thinking上发布。
KWAI KEYE-VL技术报告
- 标题: Kwai Keye-VL Technical Report
- 作者: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang
- 日期: 2025-07-02
- ArXiv主页: https://arxiv.org/abs/2507.01949
- 论文链接: https://arxiv.org/pdf/2507.01949
- 项目链接: https://kwai-keye.github.io/
- gitHub仓库: https://github.com/Kwai-Keye/Keye
英文摘要
While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today’s digital landscape. To bridge this gap, we introduce Kwai Keye-VL, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode cold-start data mixture, which includes thinking, non-thinking, auto-think, think with image, and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the KC-MMBench, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
中文摘要
虽然多模态大型语言模型(MLLM)在静态图像上表现出了显著的能力,但它们在理解动态、信息密集的短视频方面往往不足,而短视频是当今数字领域的主导媒介。为了弥合这一差距,我们引入了Kwai Keye VL,这是一个80亿参数的多模态基础模型,专为短视频理解的前沿性能而设计,同时保持了强大的通用视觉语言能力。Keye VL的开发基于两个核心支柱:一个大规模、高质量的数据集,超过6000亿个代币,重点是视频,以及一个创新的训练配方。该配方包括一个四阶段的预训练过程,以实现扎实的视觉语言对齐,然后是一个细致的两阶段后训练过程。第一个训练后阶段增强了指令遵循等基础能力,而第二个阶段侧重于激发高级推理。在第二阶段,一个关键的创新是我们的五模式冷启动数据混合,其中包括思考、非思考、自动思考、用图像思考和高质量视频数据。这种混合教会了模型决定何时以及如何推理。后续的强化学习(RL)和对齐步骤进一步增强了这些推理能力,并纠正了异常的模型行为,如重复输出。为了验证我们的方法,我们进行了广泛的评估,结果表明Keye VL在公共视频基准测试中取得了最先进的结果,并且在基于图像的一般任务中仍然具有很强的竞争力(图1)。此外,我们开发并发布了KC MMBench,这是一款专为现实世界短视频场景量身定制的新基准,Keye VL在其中显示出显著的优势。
webailor:浏览Web代理的超人类推理
- 标题: WebSailor: Navigating Super-human Reasoning for Web Agent
- 作者: Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
- 日期: 2025-07-03
- ArXiv主页: https://arxiv.org/abs/2507.02592
- 论文链接: https://arxiv.org/pdf/2507.02592
- 项目链接: https://github.com/Alibaba-NLP/WebAgent
- gitHub仓库: https://github.com/Alibaba-NLP/WebAgent
英文摘要
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
中文摘要
超越人类认知局限性代表了LLM培训中的关键领域。诸如Deepresearch之类的专有代理系统已经在极其复杂的信息寻求信息基准(例如Browsecomp)上展示了超人的能力,例如Browsecomp,这是以前无法实现的壮举。我们认为,他们的成功取决于开源模型中不存在的复杂推理模式:在浏览大量信息景观时系统地减少极端不确定性的能力。基于这种见解,我们介绍了Webledor,这是一种完整的培训方法,旨在灌输这种关键能力。我们的方法涉及通过结构化的采样和信息混淆,rft冷启动以及有效的代理RL训练算法来生成新颖的高确定性任务,重复采样策略优化(DUPO)。通过这项集成的管道,Webailor在复杂的信息寻求任务中的所有OpenSOURCE代理都大大优于所有OpenSource代理,匹配专有代理的性能并缩小功能差距。
用图像进行多模式推理的思考:基础,方法和未来的前沿
-
标题: Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
-
作者: Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung
-
日期: 2025-06-30
-
ArXiv主页: https://arxiv.org/abs/2506.23918
-
论文链接: https://arxiv.org/pdf/2506.23918
-
gitHub仓库: https://github.com/zhaochen0110/Awesome_Think_With_Images
英文摘要
Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental “semantic gap” between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
中文摘要
通过文本链(COT),多模式推理的最新进展已大大提高,这是一种模型在语言中进行推理的范式。但是,这种以文本为中心的方法将视觉视为一种静态的初始上下文,在丰富的感知数据和离散的符号思想之间产生了基本的“语义差距”。人类认知通常会超越语言,将视觉作为动态的心理素描本。现在在AI中正在展开类似的进化,标志着仅考虑图像的模型转变为可以真正思考图像的模型的基本范式转变。这种新兴的范式的特征是将视觉信息作为中间步骤的模型在其思维过程中,将视觉从被动输入转变为动态的,可操作的认知工作空间。在这项调查中,我们沿着越来越多的认知自主权的轨迹绘制了这种智能的演变,该智能的轨迹在三个关键阶段展开:从外部工具探索,到程序化操纵到固有的想象力。为了构建这个快速发展的领域,我们的调查做出了四个关键贡献。(1)我们通过图像范式及其三阶段框架建立了思想的基本原理。(2)我们对核心方法进行了全面的评论,该方法表征了该路线图的每个阶段。(3)我们分析了评估基准和变革性应用的关键格局。(4)我们确定了重大挑战,并概述了有希望的未来方向。通过提供这个结构化的概述,我们旨在为未来的研究提供明确的路线图,以实现更强大和人类一致的多模式AI。
longanimation:具有动态全局内存的长期动画生成
- 标题: LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
- 作者: Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao
- 日期: 2025-07-02
- ArXiv主页: https://arxiv.org/abs/2507.01945
- 论文链接: https://arxiv.org/pdf/2507.01945
- 项目链接: https://cn-makers.github.io/long_animation_web/
- gitHub仓库: https://github.com/VectorSpaceLab/Video-XL
英文摘要
Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.
中文摘要
动画着色是真实动画行业生产的关键部分。长时间的动画着色的人工成本很高。因此,基于视频生成模型的自动长动画着色具有重要的研究值。现有研究仅限于短期着色。这些研究采用了当地范式,融合了重叠的特征,以实现本地细分市场之间的平稳过渡。但是,本地范式忽略了全球信息,无法保持长期颜色一致性。在这项研究中,我们认为可以通过动态的全局本地范式,即动态提取与当前一代相关的全球颜色一致性特征来实现理想的长期颜色一致性。具体来说,我们提出了一个新颖的框架,主要包括sketchdit,动态的全部本地内存(DGLM)和颜色一致性奖励。SketchDit捕获了混合参考功能,以支持DGLM模块。DGLM模块采用较长的视频理解模型来动态压缩全球历史特征,并将其与当前一代特征融合在一起。为了完善颜色一致性,我们引入了颜色一致性奖励。在推断期间,我们提出了颜色一致性融合以平滑视频段过渡。在短期(14帧)和长期(平均500帧)动画上进行了广泛的实验,显示了longanimation在保持短期和长期颜色一致性方面的有效性,以实现开放域动画着色任务。代码可以在https://cn-makers.github.io/long_animation_web/上找到。
数学推理可以提高一般LLM功能吗?了解LLM推理的可转移性
-
标题: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
-
作者: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
-
日期: 2025-07-01
-
ArXiv主页: https://arxiv.org/abs/2507.00432
-
论文链接: https://arxiv.org/pdf/2507.00432
-
gitHub仓库: https://github.com/ReasoningTransfer/Transferability-of-LLM-Reasoning
英文摘要
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
中文摘要
数学推理已成为大语言模型(LLM)进步的海报孩子,新模型在数学和AIME等基准上迅速超过了人类水平的性能。但是,随着数学排行榜每周一周的改善,值得一提的是:这些收益是否反映了更广泛的问题解决能力,还是缩小过度合适?为了回答这个问题,我们在广泛的任务中评估了20多个开放式推理调节的模型,包括数学,科学质量质量质量检查,代理计划,编码和标准指令跟踪。我们惊讶地发现,大多数成功的数学模型都无法将其收益转移到其他领域。为了严格研究这种现象,我们使用仅数学数据但不同的调整方法对QWEN3-14B模型进行了对照实验。我们发现,强化学习(RL)型模型在跨域中良好地概括了,而受监督的微调(SFT)型模型通常会忘记一般功能。潜在空间表示和令牌空间分布变化分析表明,SFT诱导大量表示和输出漂移,而RL保留了通用域结构。我们的结果表明,需要重新考虑标准的培训后食谱,特别是依赖SFT延伸数据以推动推理模型。
基于能量的变压器是可扩展的学习者和思想家
- 标题: Energy-Based Transformers are Scalable Learners and Thinkers
- 作者: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal
- 日期: 2025-07-02
- ArXiv主页: https://arxiv.org/abs/2507.02092
- 论文链接: https://arxiv.org/pdf/2507.02092
- 项目链接: https://energy-based-transformers.github.io/
- gitHub仓库: https://github.com/amorehead/jvp_flash_attention
英文摘要
Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) – a new class of Energy-Based Models (EBMs) – to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.
中文摘要
推理时间计算技术(类似于人类系统2思维)最近在改善模型性能方面变得流行。但是,大多数现有的方法都有多种局限性:它们是特定于模式的(例如,仅在文本中工作),特定于问题的特定局限性(例如,可验证的域(如数学和编码)),或者需要在无治疗预处理的基础上进行其他监督/培训(例如,检验器,验证者或可验证的重新奖励)。在本文中,我们提出了一个问题:“是否有可能概括这些系统2思维方法,并开发仅从无监督学习中思考的模型?”有趣的是,我们发现答案是肯定的,它是通过学习明确验证输入和候选预测之间的兼容性,然后将预测问题重新确定为相对于此验证者的优化。具体而言,我们训练基于能量的变压器(EBTS)(一种新的基于能量的模型(EBM))为每个输入和候选预测对分配一个能量值,从而通过基于梯度下降的能量最小化直至收敛。在离散(文本)和连续(视觉)模式中,我们发现EBT的尺度比训练期间的主导变压器++方法快,相对于数据,批处理大小,参数,拖曳和深度,缩放率高达35%。在推断期间,EBT通过系统2的思维提高了性能,而在语言任务上,Transformer ++的性能高29%,而EBT在使用较少的前向通行证的同时,EBT的表现优于图像denoising上的扩散变压器。此外,我们发现EBT在给定相同或更差的性能的大多数下游任务上取得比现有模型更好的结果,这表明EBTS比现有方法更好地概括了EBT。因此,EBT是一个有希望的新范式,用于扩展模型的学习和思维能力。
BlenderFusion:3D接地的视觉编辑和生成合成
- 标题: BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
- 作者: Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo
- 日期: 2025-06-20
- ArXiv主页: https://arxiv.org/abs/2506.17450
- 论文链接: https://arxiv.org/pdf/2506.17450
- 项目链接: https://blenderfusion.github.io/
英文摘要
We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.
中文摘要
我们提出BlenderFusion,这是一种生成的视觉合成框架,通过重新组装对象,相机和背景来综合新场景。它遵循一个分层编辑的调整管道:(i)将视觉输入分割和转换为可编辑的3D实体(分层),(ii)使用3D接地控制(编辑)在搅拌机中编辑它们,(iii)使用生成的复合器(CompOsititiation)将它们融合到相干的场景中。我们的生成合并器扩展了预训练的扩散模型,以并行处理原始(源)和编辑(目标)场景。它在视频帧上进行了微调,并具有两个关键的培训策略:(i)源掩盖,实现诸如背景更换之类的灵活修改;(ii)模拟对象抖动,促进对物体和相机的脱离控制。BlenderFusion在复杂的组成场景编辑任务中显着优于先前方法。
OVIS-U1技术报告
-
标题: Ovis-U1 Technical Report
-
作者: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
-
日期: 2025-06-29
-
ArXiv主页: https://arxiv.org/abs/2506.23044
-
论文链接: https://arxiv.org/pdf/2506.23044
-
gitHub仓库: https://github.com/AIDC-AI/Ovis-U1
英文摘要
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
中文摘要
在本报告中,我们介绍了OVIS-U1,这是一个30亿参数统一模型,该模型集成了多模式理解,文本到图像生成和图像编辑功能。OVIS-U1以OVIS系列的基础为基础,结合了一个基于扩散的视觉解码器,与双向令牌炼油厂配对,从而使图像生成任务与GPT-4O等领先模型相当。与某些使用冷冻MLLM进行生成任务的模型不同,OVIS-U1采用了一种从语言模型开始的新统一训练方法。与仅在理解或发电任务上进行培训相比,统一培训可以提高表现,从而证明了通过整合这两个任务实现的增强。OVIS-U1在Opencompass多模式学术基准上取得了69.6的成绩,超过了最近最新的模型,例如Ristretto-3B和Sail-VL-1.5-2B。在文本到图像的生成中,它在DPG基础和Geneval基准分别以83.72和0.89的得分出色。对于图像编辑,它分别在IMGEDIT基础和Gedit-Bench-en上达到4.00和6.42。作为OVIS统一模型系列的初始版本,OVIS-U1推动了多模式理解,生成和编辑的边界。
langscene-X:与Trimap视频扩散重建可概括的3D语言包裹的场景
- 标题: LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
- 作者: Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan
- 日期: 2025-07-03
- ArXiv主页: https://arxiv.org/abs/2507.02813
- 论文链接: https://arxiv.org/pdf/2507.02813
- 项目链接: https://liuff19.github.io/LangScene-X/
- gitHub仓库: https://github.com/liuff19/LangScene-X
英文摘要
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.
中文摘要
从2D图像中恢复具有开放式视频场景的理解的3D结构是一项基本但艰巨的任务。最近的发展通过使用嵌入式语言信息进行人均优化实现了这一目标。但是,他们在很大程度上依赖于校准的密集视图重建范式,从而在有限的视图中遇到严重的渲染文物和令人难以置信的语义综合。在本文中,我们介绍了一个新颖的生成框架,即langscene-X,以统一和生成3D一致的多模式信息,以进行重建和理解。通过创造更一致的新颖观察结果的生成能力,我们只能从稀疏视图中构建可概括的3D语言包裹的场景。具体而言,我们首先训练可以生成外观(RGB),几何形状(正常)和语义(分段图)从稀疏输入通过渐进知识积分从稀疏输入中产生的构架视频扩散模型。此外,我们提出了一种量化的语言压缩机(LQC),该语言在大规模图像数据集上训练,以有效地编码语言嵌入,从而实现跨性别的概括而无需每次习惯重新训练。最后,我们通过将语言信息对齐到3D场景的表面,从而启用开放式语言查询来重建语言表面字段。关于现实世界数据的广泛实验表明,就质量和概括性而言,我们的langscene-X优于最先进的方法。项目页面:https://liuff19.github.io/langscene-x。
Skywork-Reward-V2:通过人类协同作用扩展偏好数据策划
-
标题: Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
-
作者: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
-
日期: 2025-07-02
-
ArXiv主页: https://arxiv.org/abs/2507.01352
-
论文链接: https://arxiv.org/pdf/2507.01352
-
gitHub仓库: https://github.com/SkyworkAI/Skywork-Reward-V2
英文摘要
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
中文摘要
尽管奖励模型(RMS)在增强人类反馈(RLHF)中的重要作用中,但当前最新的开放RMS在大多数现有的评估基准中的表现较差,但未能捕捉到细微和复杂的人类偏好。即使是结合高级培训技术的方法也没有带来有意义的绩效提高。我们假设这种脆性主要源于偏好数据集的局限性,这些数据集通常被范围狭窄,综合标记或缺乏严格的质量控制。为了应对这些挑战,我们提出了一个大规模的偏好数据集,其中包括4000万个偏好对,名为SynpRef-40m。为了大规模启用数据策划,我们设计了人类协同的两阶段管道,该管道利用人类注释质量和AI可伸缩性的互补优势。在这条管道中,人类提供了经过验证的注释,而大型语言模型则根据人类的指导进行自动策划。在此偏好混合物上进行培训,我们介绍了Skywork-Reward-V2,这是一套由0.6B到8B参数的八个奖励模型,并在经过精心策划的2600万首偏好对的子集中训练了SynpRef-40m。我们证明,Skywork-Reward-V2在广泛的功能中具有多功能性,包括与人类偏好,客观正确性,安全性,对风格偏见的抵抗力以及最佳N规模的阻力,在七个主要奖励模型基准中实现最先进的性能。消融研究证实,我们方法的有效性不仅源于数据量表,而且还源于高质量的策展。Skywork-Reward-V2系列代表了开放奖励模型中的重大进展,强调了现有偏好数据集的未开发潜力,并证明了Human-AI策展协同作用如何可以显着更高的数据质量。
深度在任何情况下
- 标题: Depth Anything at Any Condition
- 作者: Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou
- 日期: 2025-07-02
- ArXiv主页: https://arxiv.org/abs/2507.01634
- 论文链接: https://arxiv.org/pdf/2507.01634
- 项目链接: https://ghost233lism.github.io/depthanything-AC-page/
- gitHub仓库: https://github.com/HVision-NKU/DepthAnythingAC
英文摘要
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: https://ghost233lism.github.io/depthanything-AC-page Code: https://github.com/HVision-NKU/DepthAnythingAC
中文摘要
我们介绍任何条件下的任何内容(Depthanything-Ac),这是一个能够处理多种环境条件的基础单眼估计(MDE)模型。以前的基础MDE模型在一般场景中实现了令人印象深刻的性能,但在复杂的开放世界环境中表现不佳,涉及具有挑战性的条件,例如照明变化,不利天气和传感器引起的扭曲。为了克服数据稀缺性的挑战,无法从损坏的图像中产生高质量的伪标签,我们提出了一个无监督的一致性正则化芬特式范式,该范式仅需要相对较少的未标记数据。此外,我们提出了空间距离约束,以明确执行模型以学习补丁级的相对关系,从而导致更清晰的语义边界和更准确的细节。实验结果表明,跨不同基准的Depthanything-AC的零射击功能,包括现实世界中的不利天气基准,合成腐败基准和一般基准。项目页面:https://ghost233lism.github.io/depthanything-ac-page代码:https://github.com/hvision-nku/depthanythingac
螺旋:零和游戏上的自我玩法通过多代理多转弯强化学习激励推理
- 标题: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
- 作者: Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
- 日期: 2025-06-30
- ArXiv主页: https://arxiv.org/abs/2506.24119
- 论文链接: https://arxiv.org/pdf/2506.24119
- 项目链接: https://benjamin-eecs.github.io/blog/2025/spiral/
- gitHub仓库: https://github.com/spiral-rl/spiral
英文摘要
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
中文摘要
强化学习的最新进展表明,语言模型可以通过对具有可验证奖励的任务进行培训来开发复杂的推理,但是这些方法取决于人类策划的问题解答和特定领域的奖励工程。我们介绍了Spiral,这是一个自我播放框架,模型通过玩多转,零和游戏的游戏来学习,反对不断改进自己的版本,从而消除了对人类监督的需求。通过自我播放,螺旋产生了逐渐具有挑战性问题的无限课程,因为模型必须不断适应更强大的对手。为了使这种自我播放训练能够大规模培训,我们为LLMS实施了完全在线,多转,多机构增强学习系统,并提出了角色条件的优势估计(RAE)来稳定多代理培训。在零和游戏上使用螺旋形,自我玩法会产生广泛转移的推理能力。仅在库恩扑克上培训QWEN3-4B基础就可以提高数学的8.6%,而一般推理的8.4%,在25,000个专家游戏轨迹上表现优于SFT。分析表明,这种转移是通过三种认知模式发生的:系统分解,预期价值计算和逐案分析。多游戏训练(Tictactoe,Kuhn Poker,简单的谈判)进一步提高了性能,因为每个游戏都会产生独特的推理优势。将螺旋施加到强大的推理模型(DeepSeek-R1-Distill-Qwen-7b)仍然可以带来2.0%的平均改善。这些结果表明,零和游戏自然会发展可转移的推理能力,从而突出了自主推理开发的有希望的方向。
Sciarena:科学文献任务中基础模型的开放评估平台
- 标题: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
- 作者: Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
- 日期: 2025-07-01
- ArXiv主页: https://arxiv.org/abs/2507.01001
- 论文链接: https://arxiv.org/pdf/2507.01001
- 项目链接: https://sciarena.allen.ai/
- gitHub仓库: https://github.com/yale-nlp/SciArena
英文摘要
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.
中文摘要
我们提出了Sciarena,这是一个开放且协作的平台,用于评估科学文献任务的基础模型。与传统的科学文献理解和综合基准不同,Sciarena在聊天机器人竞技场评估方法上直接与研究社区进行了对模型比较的评估方法。通过利用集体智能,Sciarena对开放式科学任务的模型绩效进行了社区驱动的评估,该任务需要以文献为基础,长期的响应。该平台目前支持23个开源和专有基金会模型,并从各种科学领域的受信任的研究人员那里收集了13,000票。我们分析了到目前为止收集的数据,并确认提交的问题是多种多样的,与现实世界中的文献需求保持一致,并且参与的研究人员在评估中表现出了强烈的自谐和通知者的一致性。我们根据模型排名排行榜讨论结果和见解。为了进一步促进基于构建模型的自动化评估系统的文献任务研究,我们根据我们收集的偏好数据发布了Sciarena-eval,这是一种元评估基准。该基准通过将其成对评估与人类的选票进行比较,可以衡量模型在判断答案质量方面的准确性。我们的实验强调了基准的挑战,并强调了对更可靠的自动化评估方法的需求。
径向关注:o(nlog n)长时间视频的能量衰减稀疏的注意力
- 标题: Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
- 作者: Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
- 日期: 2025-06-24
- ArXiv主页: https://arxiv.org/abs/2506.19852
- 论文链接: https://arxiv.org/pdf/2506.19852
- 项目链接: https://hanlab.mit.edu/projects/radial-attention
- gitHub仓库: https://github.com/mit-han-lab/radial-attention
英文摘要
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with O(n log n) complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard O(n^2) dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9times speedup over the original dense attention. With minimal tuning, it enables video generation up to 4times longer while reducing training costs by up to 4.4times compared to direct fine-tuning and accelerating inference by up to 3.7times compared to dense attention inference.
中文摘要
扩散模型的最新进展已使高质量的视频生成,但是额外的时间维度显着提高了计算成本,从而使长期视频的培训和推断非常昂贵。在本文中,我们确定了一种现象,我们在视频扩散模型中称呼时空能量衰减:随着代币之间的空间和时间距离增加,类似于自然界和时间上的信号衰减或自然界时间的物理衰减,柔软后的注意力评分会降低。我们提出了径向关注,这是一种具有O(n log n)复杂性的可扩展稀疏注意机制,将能量衰减转化为指数衰减的计算密度,比标准O(n^2)密集的注意力和表达更高的效率要高得多,而不是线性的注意力。具体而言,径向注意力采用了一个简单的静态注意面膜,每个令牌都可以在空间附近的代币聚会,而注意力窗口的大小随时间距离而缩小。此外,它允许预先训练的视频扩散模型通过有效的基于洛拉的微调来扩展其发电长度。广泛的实验表明,Radial注意力在WAN2.1-14B,Hunyuanvideo和Mochi 1中保持视频质量,在原始密集的关注下达到了1.9倍的速度。随着最小的调整,与直接进行微调和加速推断相比,与密集的注意力相比,与直接调整和加速推断相比,将视频生成多达4倍,同时将培训成本降低了4.4倍。
听取内在的声音:通过中间功能反馈对控制网络训练对齐
- 标题: Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
- 作者: Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
- 日期: 2025-07-03
- ArXiv主页: https://arxiv.org/abs/2507.02321
- 论文链接: https://arxiv.org/pdf/2507.02321
- 项目链接: https://controlgenai.github.io/InnerControl/
- gitHub仓库: https://github.com/ControlGenAI/InnerControl
英文摘要
Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
中文摘要
尽管文本到图像扩散模型取得了重大进展,但实现对生成的产出的精确空间控制仍然具有挑战性。ControlNet通过引入辅助调节模块来解决此问题,而ControlNet ++通过仅应用于最终的剥离步骤的周期一致性损失进一步优化对齐。但是,这种方法忽略了中间世的阶段,从而限制了其有效性。我们提出了InnerControl,这是一种训练策略,可在所有扩散步骤中实现空间一致性。我们的方法将轻量级的卷积探针训练,以重建来自中级UNET特征的输入控制信号(例如边缘,深度)。这些探针即使从高度嘈杂的潜伏期中也有效提取信号,从而实现了伪造的真相控制。通过在整个扩散过程中最大程度地减少预测和目标条件之间的差异,我们的对齐损失可以提高控制保真度和发电质量。结合诸如ControlNet ++之类的既定技术,InnerControl在各种条件方法(例如边缘,深度)上实现了最先进的性能。
一项关于视觉动作模型的调查:动作令牌化的观点
-
标题: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
-
作者: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang
-
日期: 2025-07-02
-
ArXiv主页: https://arxiv.org/abs/2507.01925
-
论文链接: https://arxiv.org/pdf/2507.01925
-
gitHub仓库: https://github.com/Psi-Robot/Awesome-VLA-Papers
英文摘要
The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.
中文摘要
视觉和语言基础模型在多模式理解,推理和发电中的显着进步引发了越来越多的努力,将这种智力扩展到物理世界,从而助长了视觉语言行动(VLA)模型。尽管看似多样化的方法,但我们观察到当前的VLA模型可以在一个框架下统一:视觉和语言输入是由一系列VLA模块处理的,生成了一系列的动作令牌,这些链条逐渐编码了更扎根和可行的信息,最终会产生可执行的动作。我们进一步确定,区分VLA模型的主要设计选择在于如何制定动作令牌,可以将其分类为语言描述,代码,负担能力,轨迹,目标状态,潜在表示,原始动作和推理。但是,仍然缺乏有关行动代币,极大地阻碍有效的VLA开发以及掩盖未来方向的全面理解。因此,这项调查旨在通过行动令牌化,提炼每种令牌类型的优势和局限性,并确定改进领域以进行分类和解释现有的VLA研究。通过这项系统的审查和分析,我们为VLA模型的更广泛演变提供了综合的前景,突出了未充满活力但有前途的方向,并为未来的研究提供了指导,希望使该领域更接近通用智能。
书法家:自由式文本图像自定义
- 标题: Calligrapher: Freestyle Text Image Customization
- 作者: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
- 日期: 2025-06-30
- ArXiv主页: https://arxiv.org/abs/2506.24123
- 论文链接: https://arxiv.org/pdf/2506.24123
- 项目链接: https://calligrapher2025.github.io/Calligrapher/
- gitHub仓库: https://github.com/Calligrapher2025/Calligrapher
英文摘要
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher’s accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
中文摘要
我们介绍Cergigrapher,这是一个基于扩散的新型框架,将高级文本定制与数字书法和设计应用程序的艺术版式整合在一起。我们的框架解决了精确样式控制和数据依赖性的挑战,我们的框架包含了三个关键的技术贡献。首先,我们开发了一种自我验证机制,该机制利用预先训练的文本对象生成模型本身与大型语言模型一起自动构建以样式为中心的排版基准。其次,我们通过可训练的样式编码器介绍了局部样式注入框架,该框架包括QFormer和Linear层,以从参考图像中提取强大的样式功能。还采用了一种内在的生成机制将参考图像直接嵌入到剥离过程中,从而进一步增强了目标样式的精制对齐。跨不同字体和设计环境之间的广泛定量和定性评估证实了书法家对复杂的风格细节和精确的字形定位的准确再现。通过使高质量,视觉上一致的排版自动化,书法家超越了传统模型,增强了创意从业人员的数字艺术,品牌和上下文印刷设计。
MOCA:模态感知持续的预训练使双向多模式嵌入更好
- 标题: MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
- 作者: Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
- 日期: 2025-06-29
- ArXiv主页: https://arxiv.org/abs/2506.23115
- 论文链接: https://arxiv.org/pdf/2506.23115
- 项目链接: https://haon-chen.github.io/MoCa/
- gitHub仓库: https://github.com/haon-chen/MoCa
英文摘要
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
中文摘要
建立在因果视觉语言模型(VLM)的多模式嵌入模型中,在各种任务中都表现出了希望。但是,当前的方法面临三个关键局限性:在VLM骨架中使用因果关注是嵌入任务的次优。依赖高质量标记的配对数据而引起的可伸缩性问题;培训目标和数据的多样性有限。为了解决这些问题,我们提出了MOCA,这是将预训练的VLM转换为有效双向多模式嵌入模型的两阶段框架。第一个阶段,即态度感知的持续预训练,引入了一个联合重建目标,该目标同时降低了交织的文本和图像输入,从而增强了双向上下文感知的推理。第二阶段,异质的对比度微调,利用多种多样的语义富含多模式的数据超出了简单的图像构成对,以增强概括和对齐。我们的方法通过连续的预训练引入双向关注来解决陈述的局限性,并通过联合重建目标有效地通过大量未标记的数据集进行缩放,并利用多样化的多模式数据来增强表示鲁棒性。实验表明,MOCA始终提高MMEB和Vidore-V2基准测试的性能,实现新的最新结果,并具有强大的可扩展性,并具有MMEB上的模型大小和培训数据。
Llava-Scissor:带有语义连接的组件的象征性压缩,用于视频LLMS
-
标题: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
-
作者: Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
-
日期: 2025-06-27
-
ArXiv主页: https://arxiv.org/abs/2506.21862
-
论文链接: https://arxiv.org/pdf/2506.21862
-
gitHub仓库: https://github.com/HumanMLLM/LLaVA-Scissor
英文摘要
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.
中文摘要
在本文中,我们提出了LLAVA-SCISSOR,这是一种无训练的令牌压缩策略,旨在视频多模式大型语言模型。先前的方法主要尝试根据注意力评分来压缩令牌,但无法有效捕获所有语义区域,并且通常会导致令牌冗余。不同的是,我们建议利用将令牌分配给令牌集中不同语义区域的语义连接组件(SCC)方法,以确保全面的语义覆盖。结果是一种两步时空令牌压缩策略,在空间和时间域中都使用SCC。该策略可以通过用一组非重叠的语义令牌来表示整个视频,从而有效地压缩令牌。我们对Llava-Scissor的令牌压缩功能进行了广泛的评估,其中包括视频问答,长期的视频理解和全面的多选择性基准,包括视频问答,包括视频问题。实验结果表明,拟议的LLAVA-SCISSOR优于其他令牌压缩方法,在各种视频理解基准中,尤其是在低标记保留率下,取得了卓越的性能。项目页面:https://github.com/humanmllm/llava-scissor。
Intfold:用于一般和专业生物分子结构预测的可控基础模型
- 标题: IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
- 作者: The IntFold Team, Leon Qiao, Wayne Bai, He Yan, Gary Liu, Nova Xi, Xiang Zhang
- 日期: 2025-07-02
- ArXiv主页: https://arxiv.org/abs/2507.02025
- 论文链接: https://arxiv.org/pdf/2507.02025
- 项目链接: https://server.intfold.com/
- gitHub仓库: https://github.com/IntelliGen-AI/IntFold
英文摘要
We introduce IntFold, a controllable foundation model for both general and specialized biomolecular structure prediction. IntFold demonstrates predictive accuracy comparable to the state-of-the-art AlphaFold3, while utilizing a superior customized attention kernel. Beyond standard structure prediction, IntFold can be adapted to predict allosteric states, constrained structures, and binding affinity through the use of individual adapters. Furthermore, we introduce a novel confidence head to estimate docking quality, offering a more nuanced assessment for challenging targets such as antibody-antigen complexes. Finally, we share insights gained during the training process of this computationally intensive model.
中文摘要
我们介绍了Intfold,这是一种可控的基础模型,用于一般和专业的生物分子结构预测。Intfold表现出与最先进的AlphaFold3相当的预测精度,同时利用了卓越的定制注意内核。除了标准结构预测之外,可以通过使用单个适配器来预测intfold来预测变构状态,约束结构和结合亲和力。此外,我们引入了一个新颖的信心头,以估计质量对接质量,为诸如抗体 - 抗原复合物等具有挑战性的靶标提供了更细微的评估。最后,我们分享在此计算密集型模型的训练过程中获得的见解。
VMOBA:视频扩散模型的混合物注意
- 标题: VMoBA: Mixture-of-Block Attention for Video Diffusion Models
- 作者: Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
- 日期: 2025-06-30
- ArXiv主页: https://arxiv.org/abs/2506.23858
- 论文链接: https://arxiv.org/pdf/2506.23858
- 项目链接: https://github.com/KwaiVGI/VMoBA
- gitHub仓库: https://github.com/KwaiVGI/VMoBA
英文摘要
The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.
中文摘要
全部注意机制的二次复杂性为视频扩散模型(VDM)带来了重要的瓶颈,旨在生成长期的高分辨率视频。尽管已经提出了各种稀疏注意方法,但许多方法被设计为无训练的推理加速器,或者在本地训练时,视频数据固有的唯一时空特征固有的唯一时空特征。本文介绍了阻止注意力(VMOBA)的视频混合物,这是一种专门针对VDM的新型稀疏注意机制。通过对预训练的视频变压器内的注意力模式进行深入分析的激励,该视频变压器揭示了强大的时空位置,各种查询的重要性和特定于头部特定的浓度水平,VMOBA通过三个关键修改增强了原始MOBA框架:(1)层上的层次范围分配方案(1D-2D-2D-3D-3D)的动态性改善,并改善了多样性的调整,以改进多样化的改善效果。(2)全局块选择优先考虑整个注意力头的最显着查询块相互作用;(3)基于阈值的块选择,以基于其累积相似性动态确定所在块的数量。广泛的实验表明,VMOBA显着加速了VDM在更长的序列上的训练,达到2.92倍失败和1.48倍的延迟速度,同时获得了可比甚至较高的发电质量,以达到全部关注。此外,VMOBA在无训练推理方面表现出竞争性能,可为高分辨率视频发行提供2.40倍的失败和1.35倍的延迟速度。
diffucoder:理解和改进代码生成的蒙版扩散模型
-
标题: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
-
作者: Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
-
日期: 2025-06-25
-
ArXiv主页: https://arxiv.org/abs/2506.20639
-
论文链接: https://arxiv.org/pdf/2506.20639
-
gitHub仓库: https://github.com/apple/ml-diffucoder
英文摘要
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose coupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder’s performance on code generation benchmarks (+4.4% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
中文摘要
扩散大语言模型(DLLM)是自回归(AR)模型的引人注目的替代方案,因为它们的降解模型在整个序列中运行。DLLM的全球计划和迭代精炼功能对于代码生成特别有用。但是,当前的DLLM在编码中的培训和推理机制仍未探索。为了揭开DLLM的解码行为并解锁其编码潜力,我们系统地研究了它们的降解过程和加强学习(RL)方法。我们在130B代码代码上训练7B DLLM,Diffucoder。我们将该模型作为测试台,分析其解码行为,揭示其与AR模型的不同之处:(1)DLLM可以在不依赖半AR解码的情况下决定其产生的因果关系,以及(2)增加采样温度的多样性不仅会使标记选择,而且还会产生代序。这种多样性为RL推出创造了丰富的搜索空间。对于RL训练,为了减少令牌样品可能估计的差异并保持训练效率,我们提出了耦合GRPO,这是一种新型的抽样方案,该方案构建了用于培训中使用的互补掩模噪声。在我们的实验中,耦合GRPO显着提高了Diffucoder在代码生成基准(+4.4 \%\%)上的性能,并降低了解码过程中对AR因果关系的依赖。我们的工作为DLLM生成机械提供了更深入的了解,并提供了一个有效的扩散的RL训练框架。https://github.com/apple/ml-diffucoder。
视觉语言模型是否具有内部世界模型?进行原子评估
- 标题: Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
- 作者: Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
- 日期: 2025-06-27
- ArXiv主页: https://arxiv.org/abs/2506.21876
- 论文链接: https://arxiv.org/pdf/2506.21876
- 项目链接: https://wm-abench.maitrix.org
英文摘要
Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding – e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
中文摘要
内部世界模型(WMS)使代理商能够理解世界国家并预测过渡,这是高级审议推理的基础。最近的大型视力模型(VLM),例如OpenAI O3,GPT-4O和Gemini,具有通用WMS的潜力。尽管最新的研究已经评估并显示了特定功能(例如视觉理解)的局限性,但对VLMS的基本WM能力的系统评估仍然不存在。利用比较心理学和认知科学,我们提出了一个两阶段的框架,该框架评估感知(视觉,空间,时间,定量和运动)以及预测(机械模拟,及时推理,组成推断),以提供对VLMS作为WMS的原子评估。在此框架的指导下,我们引入了WM-Abench,这是一个大规模的基准测试,其中包括6种不同的模拟环境,具有23个细粒度的评估维度,并具有受控的反事实模拟。通过对15个最新商业和开源VLM的660个实验,我们发现这些模型在基本世界建模能力中表现出明显的局限性。例如,在区分运动轨迹时,几乎所有模型都以几乎随机的精度执行。此外,它们缺乏脱节的理解 - 例如,某些模型倾向于相信蓝色的物体比绿色物体更快。更丰富的结果和分析揭示了VLM与人类水平的世界建模之间的显着差距。
xverse:通过DIT调制对身份和语义属性的一致多主体控制
- 标题: XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
- 作者: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
- 日期: 2025-06-26
- ArXiv主页: https://arxiv.org/abs/2506.21416
- 论文链接: https://arxiv.org/pdf/2506.21416
- 项目链接: https://bytedance.github.io/XVerse/
- gitHub仓库: https://github.com/bytedance/XVerse
英文摘要
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
中文摘要
在文本到图像生成中,对主题身份和语义属性(姿势,样式,照明)实现细粒度的控制,尤其是对于多个受试者,经常会破坏扩散变压器(DITS)的编辑性和连贯性。许多方法引入文物或属于属性纠缠。为了克服这些挑战,我们提出了一种新型的多主体受控生成模型X versevers。通过将参考图像转换为特定于代币的文本流调制的偏移,Xverse可以对特定主题进行精确和独立的控制,而不会破坏图像潜在或特征。因此,Xverse提供了高保真,可编辑的多主体图像综合,具有对单个主题特征和语义属性的强大控制。这种进步大大提高了个性化和复杂的场景生成功能。