ICCV2025接收论文速览(1)
最近收集了一批已被ICCV2025接收的论文,因为论文比较多,所以先做了一个论文速览,后面会持续更新并且做论文解读,内容有点多,一起开始研读起来吧。可以搜索并关注【AI启智汇】公众号,获取后续更新。
(一)
标题:RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment
中文标题:RL-Selector:通过冗余评估强化学习指导的数据选择
英文摘要:Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. We introduce the concept of epsilon-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process and propose RL-Selector, where a lightweight RL agent optimizes the selection policy by leveraging epsilon-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency.
中文摘要:现代深度架构通常依赖于大规模数据集,但在这些数据集上进行训练会产生很高的计算和存储开销。真实世界的数据集通常包含大量冗余,促使需要更具数据效率的训练范式。数据选择已显示出通过识别最具代表性的样本来减轻冗余的前景,从而在不影响性能的情况下降低训练成本。现有方法通常依赖于静态评分指标或预训练模型,忽略了选定样本的组合效应及其在训练期间不断演变的动态。我们引入了ε-样本覆盖的概念,它基于样本间关系量化样本冗余,捕获数据集的内在结构。基于此,我们将数据选择重新制定为强化学习(RL)过程,并提出了RL-选择器,其中轻量级RL代理通过利用从不断演变的数据集分布中派生的ε-样本覆盖作为奖励信号来优化选择策略。跨基准数据集和不同架构的广泛实验表明,我们的方法始终优于现有的最先进的基线。使用我们选择的数据集训练的模型显示出增强的泛化性能和更高的训练效率。
方法框架:
(二)
标题:ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers
中文标题:ResQ:在模拟里德堡原子量子计算机上实现残差神经网络的新框架
英文摘要:Research in quantum machine learning has recently proliferated due to the potential of quantum computing to accelerate machine learning. An area of machine learning that has not yet been explored is neural ordinary differential equation (neural ODE) based residual neural networks (ResNets), which aim to improve the effectiveness of neural networks using the principles of ordinary differential equations. In this work, we present our insights about why analog Rydberg atom quantum computers are especially well-suited for ResNets. We also introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom quantum computers to solve classification problems in machine learning using analog quantum neural ODEs.
中文摘要:由于量子计算在加速机器学习方面具有潜力,量子机器学习的研究近来迅速发展。机器学习领域中尚未被探索的一个方向是基于神经常微分方程(神经ODE)的残差神经网络(ResNets),其旨在运用常微分方程原理提升神经网络的效能。在这项研究中,我们阐述了为何模拟里德堡原子量子计算机特别适合ResNets的见解。我们还介绍了ResQ,这是一种全新的框架,用于优化里德堡原子量子计算机的动力学过程,以通过模拟量子神经ODE解决机器学习中的分类问题。
方法框架:
(三)
标题:Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision
中文标题:医学视觉中像素级预训练的矢量对比学习
英文摘要:Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.
中文摘要:对比学习(CL)已成为基础模型中自监督预训练(SSP)的基石,然而,将CL扩展到对医学视觉至关重要的像素级表示仍然是一个悬而未决的问题。标准CL将SSP表述为二元最佳化问题(二元CL),其中过度追求特征分散会导致过度分散问题,从而破坏像素级特征相关性,从而破坏类内分布。我们的向量CL将CL重新表述为向量回归问题,通过在回归位移向量中建模特征距离来实现像素级预训练中的分散量化。为了实现这一新范式,我们提出了向量回归中的COntrast(COVER)框架。COVER建立了一个可扩展的基于向量的自学习,强制执行从向量回归到距离建模的一致优化流程,并利用向量金字塔架构进行颗粒度自适应,从而在SSP中保留像素级特征相关性。跨越8个任务的广泛实验,跨越2个维度和4种模式,表明COVER显着改进了像素级SSP,推进了可概括的医学视觉基础模型。
方法框架:
(四)
标题:Learning to See in the Extremely Dark
中文标题:学会在极端黑暗中看东西
英文摘要:Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method.
中文摘要:基于学习的方法在弱光RAW图像增强方面取得了可喜的进展,而由于缺乏相应的数据集,它们对环境照度降至0.0001勒克斯的极端黑暗场景的能力仍有待探索。为此,我们提出了一种配对数据合成管道,能够在0.01-0.1勒克斯、0.001-0.01勒克斯和0.0001-0.001勒克斯的三个精确照度范围内生成校准良好的极弱光RAW图像,以及高质量的sRGB参考,以构成一个名为See-in-the-Extremely-Dark(SIED)的大规模配对数据集,以对弱光RAW图像增强方法进行基准测试。此外,我们提出了一个基于扩散的框架,该框架利用扩散模型的生成能力和内在去噪特性来恢复极低信噪比RAW输入的视觉愉悦结果,其中引入了自适应照明校正模块(AICM)和颜色一致性损失,以确保准确的曝光校正和颜色恢复。对提议的SIED和公开可用基准的广泛实验证明了我们方法的有效性。
方法框架:
(五)
标题:CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
中文标题:CL-Splats:具有局部优化的高斯Splatting的持续学习
英文摘要:In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.
中文摘要:在动态3D环境中,随着时间的推移准确更新场景表示对于机器人技术、混合现实和具身人工智能中的应用至关重要。随着场景的发展,需要有效的方法来合并变化,以保持最新的高质量重建,而无需重新优化整个场景的计算开销。本文介绍了CL-Splats,它从稀疏场景捕获中增量更新基于高斯飞溅的3D表示。CL-Splats集成了一个强大的变化检测模块,该模块可以分割场景中的更新和静态组件,实现集中的局部优化,避免不必要的重新计算。此外,CL-Splats支持存储和恢复以前的场景状态,促进时间分割和新的场景分析应用。我们广泛的实验表明,CL-Splats实现了高效更新,重建质量优于最先进的。这为未来3D场景重建任务的实时适应奠定了坚实的基础。
方法框架:
(六)
标题:OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography
中文标题:OracleFusion:使用结构约束语义排版辅助Oracle Bone脚本的解密
英文摘要:As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
中文摘要:作为最早的古代语言之一,甲骨文(OBS)封装了古代文明的文化记录和智力表达。尽管发现了大约4,500个OBS字符,但只有大约1,600个被破译。其余未破译的字符,以其复杂的结构和抽象的意象,对解释提出了重大挑战。为了应对这些挑战,本文提出了一种新颖的两阶段语义排版框架,命名为OracleFusion。在第一阶段,这种方法利用具有增强空间感知推理(SAR)的多模态大语言模型(MLLM)来分析OBS字符的字形结构,并对关键组件进行视觉定位。在第二阶段,我们引入了Oracle结构矢量融合(OSVF),结合了字形结构约束和字形维护约束,以确保语义丰富的矢量字体的准确生成。这种方法保留了字形结构的客观完整性,提供了视觉增强的表示,帮助专家破译OBS。广泛的定性和定量实验表明,OracleFusion在语义学、视觉吸引力和字形维护方面优于最先进的基线模型,显著提高了读易性和美学质量。此外,OracleFusion对看不见的甲骨文字符提供了专家般的见解,使其成为推进OBS破译的宝贵工具。
方法框架:
(七)
标题:EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
中文标题:EgoAdapt:高效自我中心感知的自适应多感官蒸馏和策略学习
英文摘要:Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.
中文摘要:现代感知模型,特别是那些为多感官自我中心任务设计的模型,已经取得了卓越的性能,但通常会带来巨大的计算成本。这些高需求给现实世界的部署带来了挑战,尤其是在resource-constrained环境中。在本文中,我们介绍了EgoAdapt,这是一个自适应地执行跨模态蒸馏和策略学习的框架,以实现跨不同自我中心感知任务的高效推理,包括自我中心动作识别、主动说话者本地化和行为预期。我们提出的策略模块适用于特定任务的动作空间,使其广泛适用。在三个具有挑战性的自我中心数据集EPIC-Kitchens、EasyCom和Aria Everyday Activity上的实验结果表明,我们的方法显着提高了效率,将GMAC降低了高达89.09%,参数降低了82.02%,能量降低了9.6倍,同时仍然与相应的最先进模型持平,在许多情况下表现更好。
方法框架:
(八)
标题:Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
中文标题:使用自监督视觉Transformer特征提高生成对抗可移植性
英文摘要:The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts.
中文摘要:深度神经网络 (DNN) 的能力源于从所提供数据中提取和解读特征。通过利用 DNN 中的中间特征而非依赖硬标签,我们设计了能够更有效地泛化的对抗性扰动,从而提升了黑盒迁移能力。这些特征普遍存在于先前研究的监督学习中。受自监督学习与 Transformer 架构之间卓越协同效应的启发,本文探讨了利用自监督视觉变换器 (ViT) 表征能否提升对抗性迁移能力。我们提出了 dSVA——一种生成式双自监督视觉变换器特征攻击,它同时利用对比学习 (CL) 中的全局结构特征和蒙版图像建模 (MIM) 中的局部纹理特征,两者是视觉变换器 (ViT) 的自监督学习范式组合。我们设计了一个新颖的生成式训练框架,该框架包含一个用于创建黑盒对抗样本的生成器,以及一些通过利用自监督 ViT 的联合特征和注意力机制来训练生成器的策略。我们的研究结果表明,CL 和 MIM 使 ViT 能够关注不同的特征趋势,而这些特征趋势如果结合使用,将具有出色的对抗泛化能力。通过破坏自监督 ViT 提炼出的双重深度特征,我们获得了卓越的黑盒迁移能力,这些迁移能力可以迁移到各种架构的模型,并超越最先进的技术。
方法框架:
(九)
标题:Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability
中文标题:使用扩散模型增强领域广义和自适应检测:适用性、泛化和可转移性
英文摘要:Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains.
中文摘要:由于训练和测试数据之间的域差距,检测器经常遭受性能下降的困扰。最近的方法探索了应用于域泛化(DG)和自适应(DA)任务的扩散模型,但仍在与较大的推理成本作斗争,并且尚未充分利用扩散模型的能力。我们建议通过从单步扩散过程中提取中间特征来解决这些问题,改进特征收集和融合以减少75%的推理时间,同时增强源域(即Fitness)上的性能。然后,我们通过应用带有类提示的框掩蔽图像来构建以对象为中心的辅助分支,以提取专注于对象的鲁棒和域不变特征。我们还应用一致性损失来对齐辅助分支和普通分支,平衡适应度和泛化,同时防止过度拟合并提高目标域(即泛化)上的性能。此外,在统一的框架内,标准检测器由扩散检测器引导,通过源域(对于DG)和未标记的目标域(对于DA)上的特征级和对象级对齐,从而提高跨域检测性能(即可转移性)。我们的方法在3个DA基准和5个DG基准上取得了具有竞争力的结果。此外,在COCO泛化基准上的实验表明,我们的方法保持了显着的优势,并在大领域移动和低数据场景中表现出显着的效率。我们的工作显示了将扩散模型应用于领域泛化和自适应检测任务的优越性,并为跨不同领域的视觉感知任务提供了有价值的见解。
方法框架:
(十)
标题:Multimodal Prompt Alignment for Facial Expression Recognition
中文标题:人脸表情识别的多模态提示对齐
英文摘要:Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
中文摘要:提示学习已被广泛采用,以有效地使CLIP等视觉语言模型(VLM)适应各种下游任务。尽管取得了成功,但当前基于VLM的面部表情识别(FER)方法难以捕捉细粒度的文本-视觉关系,这对于区分面部表情之间的细微差异至关重要。为了应对这一挑战,我们提出了一种用于FER的多模态提示对齐框架,称为MPA-FER,它为提示视觉特征的学习过程提供细粒度的语义指导,从而产生更精确和可解释的表示。具体来说,我们引入了一种多颗粒度的硬提示生成策略,该策略利用像ChatGPT这样的大语言模型(LLM)为每个面部表情生成详细的描述。通过最小化软提示和硬提示之间的特征差异,将基于LLM的外部知识注入到软提示中。为了保持预训练CLIP模型的泛化能力,我们的方法结合了原型引导的视觉特征对齐,确保来自冻结图像编码器的提示视觉特征与特定类别的原型紧密对齐。此外,我们提出了一个跨模态全局局部对齐模块,该模块专注于与表情相关的面部特征,进一步改善了文本和视觉特征之间的对齐。广泛的实验表明,我们的框架在三个FER基准数据集上优于最先进的方法,同时保留了预训练模型的优势并最大限度地降低了计算成本。
方法框架:
(十一)
标题:Rethink Sparse Signals for Pose-guided Text-to-image Generation
中文标题:重新思考姿势引导文本到图像生成的稀疏信号
英文摘要:Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals.
中文摘要:最近的工作倾向于密集信号(例如,深度,DensePose),作为稀疏信号(例如,OpenPose)的替代方案,为姿势引导的文本到图像生成提供详细的空间引导。然而,密集表示提出了新的挑战,包括编辑困难和与文本提示的潜在不一致。这一事实促使我们重新审视用于姿势引导的稀疏信号,因为它们的简单性和形状无关的性质,但仍未得到充分开发。本文提出了一种新颖的空间-姿势控制网(SP-Ctrl),为稀疏信号配备了强大的可控性,用于姿势引导图像生成。具体来说,我们将OpenPose扩展到可学习的空间表示,使关键点嵌入具有判别性和表现力。此外,我们引入了关键点概念学习,它鼓励关键点标记关注每个关键点的空间位置,从而改善姿势对齐。在以动物和人类为中心的图像生成任务上的实验表明,我们的方法优于最近在稀疏姿态引导下的空间可控T2I生成方法,甚至与基于密集信号的方法的性能相匹配。此外,SP-Ctrl在通过稀疏信号生成不同物种和跨物种方面表现出有希望的能力。
方法框架:
(十二)
标题:PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
中文标题:PhysRig:基于可微物理的蒙皮与骨骼绑定框架:用于逼真的关节对象建模
英文摘要:Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.
中文摘要:蒙皮和索具是动画、关节对象重建、运动转移和4D生成中的基本组成部分。现有方法主要依赖线性混合蒙皮(LBS),因为它简单且可区分。然而,LBS引入了体积损失和非自然变形等伪影,并且未能对软组织、毛皮和柔性附属物(例如象鼻、耳朵和脂肪组织)等弹性材料进行建模。在这项工作中,我们提出了MediRig:一种基于物理的可微蒙皮和索具框架,通过将刚性骨架嵌入到体积表示(例如四面体网格)中来克服这些限制,该表示被模拟为由动画骨架驱动的可变形软体结构。我们的方法利用连续介质力学并将对象离散为嵌入欧拉背景网格中的粒子,以确保材料属性和骨骼运动的可微性。此外,我们引入了材料原型,在保持高表现力的同时显着减少了学习空间。为了评估我们的框架,我们使用来自Objaverse、The Amaze Animals Zoo和MixaMo的网格构建了一个全面的合成数据集,涵盖了不同的对象类别和运动模式。我们的方法始终优于传统的基于LBS的方法,生成更逼真和物理上可信的结果。此外,我们展示了我们的框架在姿势转移任务中的适用性,突出了其在关节对象建模方面的多功能性。
方法框架:
(十三)
标题:M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
中文标题:基于边缘感知的多光谱多尺度注意力图像伪造定位
英文摘要:Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.
中文摘要:图像编辑技术发展迅速,既推动了创新应用,也为数字图像的恶意篡改提供了便利。基于深度学习的方法最近在像素级伪造定位方面取得了较高的准确率,但它们常常面临计算开销大以及表征能力有限的问题,尤其是在处理细微或复杂的篡改时。在本文中,我们提出了M2SFormer,这是一种基于Transformer编码器的全新框架,旨在克服这些挑战。与单独处理空间和频率线索的方法不同,M2SFormer在跳跃连接中统一了多频率和多尺度注意力,利用全局上下文来更好地捕捉各种伪造痕迹。此外,我们的框架通过使用全局先验图来解决上采样过程中细节丢失的问题,该全局先验图是一种表示伪造定位难度的曲率度量,然后它会引导一个难度引导注意力模块,更有效地保留细微的篡改痕迹。在多个基准数据集上进行的大量实验表明,M2SFormer优于现有的最先进模型,在检测和定位未知领域的伪造方面具有卓越的泛化能力。
方法框架:
(十四)
标题:SAM4D: Segment Anything in Camera and LiDAR Streams
中文标题:SAM4D:分割相机和激光雷达流中的任何内容
英文摘要:We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
中文摘要:我们提出了SAM4D,这是一个多模态和时态基础模型,旨在实现跨摄像头和激光雷达数据流的可提示分割。引入了统一多模态位置编码(UMPE),以在共享的3D空间中对齐摄像头和激光雷达特征,实现无缝的跨模态提示与交互。此外,我们提出了运动感知跨模态记忆注意力(MCMA),它利用自运动补偿来增强时间一致性和长距离特征检索,确保在动态变化的自动驾驶场景中进行稳健的分割。为避免标注瓶颈,我们开发了一个多模态自动数据引擎,该引擎结合了VFM驱动的视频掩码片段、时空4D重建和跨模态掩码片段融合。这个框架生成摄像头与激光雷达对齐的伪标签,其速度比人工标注快几个数量级,同时在点云表示中保留了源自VFM的语义保真度。我们在构建的Waymo - 4DSeg上进行了大量实验,这些实验证明了所提出的SAM4D强大的跨模态分割能力以及在数据标注方面的巨大潜力。
方法框架:
(十五)
标题:StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
中文标题:StruMamba3D:探索用于自监督点云表示学习的结构Mamba
英文摘要:Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
中文摘要:最近,基于Mamba的方法通过利用具有高效上下文建模能力和线性复杂性的状态空间模型(SSM),在点云表示学习中展示了令人印象深刻的性能。然而,这些方法仍然面临着限制SSM潜力的两个关键问题:在SSM处理期间破坏3D点的邻接性,以及随着下游任务中输入长度的增加而无法保留长序列内存。为了解决这些问题,我们提出了StruMamba3D,这是一种用于自监督点云表示学习的新型范式。它享有几个优点。首先,我们设计空间状态并将它们用作代理,以保持点之间的空间依赖关系。其次,我们通过状态明智的更新策略增强SSM,并结合轻量级卷积来促进空间状态之间的交互,以实现高效的结构建模。第三,我们的方法通过引入序列长度自适应策略来降低预训练的基于Mamba的模型对不同输入长度的敏感性。四个下游任务的实验结果展示了我们方法的卓越性能。此外,我们的方法在ModelNet40上获得了SOTA 95.1%的准确率,在ScanObjectNN最具挑战性的拆分上获得了92.75%的准确率,无需投票策略。
方法框架:
(十六)
标题:G2D: Boosting Multimodal Learning with Gradient-Guided Distillation
中文标题:G2 D:使用梯度引导蒸馏促进多模态学习
英文摘要:Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G2D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G2D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G2D on multiple real-world datasets and show that G2D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks.
中文摘要:多模态学习旨在利用来自不同数据模态的信息来实现更全面的性能。然而,传统的多模态模型经常遭受模态不平衡,其中一个或几个模态主导模型优化,导致次优特征表示和弱模态利用不足。为了应对这一挑战,我们引入了梯度引导蒸馏(G2D),这是一种知识蒸馏框架,它通过融合单峰和多模态目标的定制损失函数来优化多模态模型。G2D进一步在学习过程中融入了动态顺序模态优先级(SMP)技术,以确保每个模态引领学习过程,避免更强模态掩盖较弱模态的陷阱。我们在多个真实世界数据集上验证了G2D,并表明G2D在训练时放大了弱模态的重要性,并在分类和回归任务中优于最先进的方法。
方法框架:
(十七)
标题:GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
中文标题:GGTalker:具有可概括高斯先验和身份特定适应的说话头系统
英文摘要:Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
中文摘要:创建高质量、可泛化的语音驱动3D虚拟说话人头像是一个长期存在的挑战。以往的方法在固定视角和小规模音频变化的情况下能取得令人满意的结果,但在头部大幅转动和分布外(OOD)音频方面存在困难。此外,它们还受到耗时的、特定身份训练需求的限制。我们认为核心问题在于缺乏足够的3D先验知识,这限制了合成的虚拟说话人的外推能力。为了解决这一问题,我们提出了GGTalker,它通过结合可泛化先验和特定身份适配来合成虚拟说话人头像。我们引入了一个两阶段的先验适配训练策略,以学习高斯头部先验并适应个体特征。我们训练音频 - 表情和表情 - 视觉先验,以捕捉嘴唇运动的通用模式和头部纹理的一般分布。在定制适配过程中,精确模拟个体说话风格和纹理细节。此外,我们引入了一个颜色多层感知器(MLP)来生成细粒度、与运动对齐的纹理,并使用一个身体图像修复器将渲染结果与背景融合,生成难以区分的、逼真的视频帧。综合实验表明,GGTalker在渲染质量、3D一致性、唇形同步准确性和训练效率方面均达到了最先进的性能。
方法框架:
(十八)
标题:TITAN: Query-Token based Domain Adaptive Adversarial Learning
中文标题:TITAN:基于查询令牌的领域自适应对抗学习
英文摘要:We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which separates the target images into two subsets: those similar to the source (easy) and those dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token-based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively.
中文摘要:我们关注无源域自适应目标检测(SF-DAOD)问题,即在自适应过程中源数据不可用,模型必须适应无标签的目标域。针对该问题的大多数方法采用基于师生(ST)框架的自监督方法,通过源预训练模型生成伪标签以进行进一步微调。我们发现,由于教师模型的崩溃,学生模型的性能往往会急剧下降,这主要是由伪标签中的高噪声导致的,而噪声源于域偏差、差异以及跨域的显著域转移。为了获得可靠的伪标签,我们提出了一种基于目标的迭代查询令牌对抗网络(TITAN),它将目标图像分为两个子集:与源相似的(简单)和与源不相似的(困难)。我们提出一种估计方差的策略来划分目标域。这种方法利用了这样一种见解,即较高的检测方差对应于较高的召回率以及与源域更高的相似性。此外,我们将基于查询令牌的对抗模块纳入师生基线框架,以缩小两种特征表示之间的域差距。在四个自然图像数据集和两个具有挑战性的医学数据集上进行的实验证实了TITAN相较于现有最先进(SOTA)方法的卓越性能。我们报告称,在C2F、C2B、S2C和K2C基准测试中,相对于当前的SOTA方法,平均精度均值(mAP)分别提高了22.7%、22.2%、21.1%和3.7%。
方法框架:
(十九)
标题:Global and Local Entailment Learning for Natural World Imagery
中文标题:自然世界图像的全局和局部环境学习
英文摘要:Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models.
中文摘要:学习视觉语言模型中数据的层次结构是一项重大挑战。以前的工作试图通过使用蕴涵学习来解决这一挑战。然而,这些方法未能明确地建模蕴涵的传递性质,这在表示空间内建立了秩序和语义学之间的关系。在这项工作中,我们引入了径向跨模态嵌入(RCME),这是一个能够显式建模transitivity-enforced蕴涵的框架。我们提出的框架针对视觉语言模型中概念的部分顺序进行了优化。通过利用我们的框架,我们开发了一个能够表示生命之树中层次结构的层次视觉语言基础模型。我们关于层次物种分类和层次检索任务的实验证明,与现有最先进的模型相比,我们的模型增强了性能。
方法框架:
(二十)
标题:Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
中文标题:用于三维参数曲线重建的曲线感知高斯溅射法
英文摘要:This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
中文摘要:本文提出了一种端到端的框架,可直接从多视图边缘图重建三维参数曲线。与现有的遵循 “边缘点云重建和参数曲线拟合” 顺序流水线的两阶段方法不同,我们的单阶段方法直接从二维边缘图优化三维参数曲线,消除了因不连续阶段之间固有的优化差距而导致的误差累积。然而,参数曲线本质上不太适合基于渲染的多视图优化,因此需要一种互补的表示形式,既能保留其几何特性,又能实现可微渲染。我们提出了一种参数曲线与面向边缘的高斯分量之间新颖的双向耦合机制。这种紧密的对应关系构建了一种曲线感知的高斯表示,即 \textbf{曲线高斯(CurveGaussian)},它能够对三维曲线进行可微渲染,从而可以在多视图证据的指导下进行直接优化。此外,我们在训练过程中引入了一个动态自适应拓扑优化框架,通过线性化、合并、拆分和修剪操作来细化曲线结构。在ABC数据集和真实世界基准测试上的全面评估表明,我们的单阶段方法优于两阶段方法,特别是在生成更清晰、更稳健的重建结果方面。此外,通过直接优化参数曲线,我们的方法在训练过程中显著减少了参数数量,与现有方法相比,实现了更高的效率和更优的性能。
方法框架:
(二十一)
标题:CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection
中文标题:CA-I2P:具有全局最优选择的通道自适应配准网络
英文摘要:Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.
中文摘要:无检测方法通常遵循从粗到细的流程,提取图像和点云特征以进行块级匹配,并细化密集的像素到点的对应关系。然而,图像和点云之间特征通道注意力的差异可能会导致匹配结果下降,最终影响配准精度。此外,场景中的相似结构可能会在跨模态匹配中导致冗余的对应关系。为了解决这些问题,我们提出了通道自适应调整模块(CAA)和全局最优选择模块(GOS)。CAA增强模态内特征并抑制跨模态敏感性,而GOS用全局优化代替局部选择。在RGB - D场景V2和7 - 场景上的实验证明了我们方法的优越性,在图像到点云配准方面取得了最先进的性能。
方法框架:
(二十二)
标题:PanSt3R: Multi-view Consistent Panoptic Segmentation
中文标题:PanSt3R:多视图一致全景分割
英文摘要:Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.
中文摘要:3D场景的全景分割,即在场景的密集3D重建中对物体实例进行分割和分类,是一个具有挑战性的问题,尤其是在仅依赖无姿态2D图像的情况下。现有方法通常利用现成的模型来提取每帧的2D全景分割,然后优化一种隐式几何表示(通常基于神经辐射场(Neural Radiance Field,NeRF)),以整合和融合这些2D预测结果。我们认为,对于一个本质上是3D且多视角的问题,依赖2D全景分割可能并非最优选择,因为它未能充分利用跨视角空间关系的全部潜力。除了需要相机参数外,这些方法还需要针对每个场景进行计算成本高昂的测试时优化。相反,在这项工作中,我们提出了一种统一且集成的方法PanSt3R,通过在单次前向传播中联合预测3D几何形状和多视角全景分割,从而无需进行测试时优化。我们的方法基于3D重建领域的最新进展,特别是基于MUSt3R(DUSt3R的可扩展多视角版本),并通过增强其语义感知和多视角全景分割能力对其进行改进。我们还重新审视了标准的后处理掩码合并过程,并引入了一种更具原则性的多视角分割方法。我们还基于PanSt3R和普通3DGS的预测结果,引入了一种生成新视角预测的简单方法。总体而言,所提出的PanSt3R概念简单,但速度快且具有可扩展性,在多个基准测试中取得了最先进的性能,同时比现有方法快几个数量级。
方法框架:
(二十三)
标题:DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
中文标题:DuET:通过无示例任务算法实现双增量目标检测
英文摘要:Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.
中文摘要:现实世界中的目标检测系统,如自动驾驶和监控领域中的系统,必须持续学习新的目标类别,同时适应不断变化的环境条件。现有的方法,类别增量目标检测(CIOD)和域增量目标检测(DIOD)仅解决了这一挑战的一个方面。CIOD在未见领域中表现不佳,而DIOD在学习新类别时会遭受灾难性遗忘,这限制了它们在现实世界中的适用性。为了克服这些限制,我们引入了双增量目标检测(DuIOD),这是一种更实际的设置,能够以无示例的方式同时处理类别和域的变化。我们提出了DuET,这是一种基于任务算术的模型融合框架,它能够实现稳定的增量学习,同时通过一种新颖的方向一致性损失来缓解符号冲突。与先前的方法不同,DuET与检测器无关,这使得诸如YOLO11和RT - DETR等模型能够作为实时增量目标检测器运行。为了全面评估保留能力和适应性,我们引入了保留 - 适应性指数(RAI),它将用于衡量灾难性遗忘的平均保留指数(Avg RI)和用于衡量域适应性的平均泛化指数结合到一个共同的基础上。在Pascal系列和多样天气系列上进行的大量实验证明了DuET的有效性,在Pascal系列(4个任务)上,RAI提升了13.12%,同时保留了89.3%的Avg RI,在多样天气系列(3个任务)上,RAI提升了11.39%,并保持了88.57%的Avg RI,优于现有方法。
方法框架:
(二十四)
标题:Temporal Rate Reduction Clustering for Human Motion Segmentation
中文标题:用于人体运动分割的时间速率降低聚类
英文摘要:Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering(TR2C), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by (TR2C) maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.
中文摘要:人体运动分割(HMS)旨在将视频划分为不重叠的人体运动,近年来受到了越来越多的研究关注。现有的HMS方法主要以子空间聚类方法为主,这些方法基于高维时间数据符合子空间并集(UoS)分布的假设。然而,在复杂背景下捕捉复杂人体运动的视频帧可能与UoS分布不太相符。在本文中,我们提出了一种新颖的HMS方法,称为时间速率降低聚类(TR2C),它联合学习结构化表示和亲和力来分割视频中的帧序列。具体来说,由(TR2C)学习到的结构化表示保持时间上的一致性,并且与UoS结构很好地对齐,这有利于HMS任务。我们在五个基准HMS数据集上进行了广泛的实验,并使用不同的特征提取器取得了最先进的性能。
方法框架:
(二十五)
标题:ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
中文标题:ReME:一种以数据为中心的无训练开放词汇分割框架
英文摘要:Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training.
中文摘要:无训练的开放词汇语义分割(OVS)旨在在无需进行代价高昂的模型微调的情况下,根据一组任意的文本类别对图像进行分割。现有的解决方案通常会探索预训练模型(如CLIP)的注意力机制,或者生成合成数据并设计复杂的检索流程来执行OVS。然而,它们的性能受到所依赖模型的能力或参考集质量欠佳的限制。在这项工作中,我们针对这一具有挑战性的密集场景理解任务,研究了在很大程度上被忽视的数据质量问题,并发现高质量的参考集能够显著有益于无训练的OVS。基于这一观察,我们引入了一个以数据质量为导向的框架,该框架包括一个数据管道,用于构建具有良好匹配的分割-文本嵌入的参考集,以及一个简单的基于相似度的检索方法,以揭示数据的关键作用。值得注意的是,在十个基准数据集上进行的广泛评估表明,我们的方法优于所有现有的无训练OVS方法,凸显了以数据为中心的设计对于推动无训练的OVS发展的重要性。我们的代码可在这个https网址获取。
方法框架:
(二十六)
标题:Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
中文标题:解锁约束:无源遮挡感知无缝分割
英文摘要:Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method.
中文摘要:全景图像处理对于全场景感知至关重要,但面临诸如畸变、视角遮挡和标注有限等限制。以往的无监督域适应方法将知识从有标注的针孔数据转移到无标注的全景图像,但它们需要获取源针孔数据。为解决这些问题,我们引入一个更具实际意义的任务,即无源遮挡感知无缝分割(SFOASS),并提出其首个解决方案,称为无约束学习全场景知识(UNLOCK)。具体而言,UNLOCK 包含两个关键模块:全场景伪标签学习和非模态驱动的上下文学习。该框架在不依赖源数据或目标标签的情况下进行自适应,提升模型以实现 360°视角覆盖和遮挡感知推理的分割。此外,我们通过实到实和合成到实的适应设置对所提出的 SFOASS 任务进行基准测试。实验结果表明,我们的无源方法实现了与依赖源数据的方法相当的性能,在平均平均精度(mAAP)中达到 10.9 的领先分数,在平均精度(mAP)中达到 11.6,与仅使用源数据的方法相比,在平均精度质量(mAPQ)中绝对提升 +4.3。
方法框架:
以上就是最近收集的ICCV2025已录用论文,后续还会持续更新,感兴趣的可以微信搜索公众号【AI启智汇】进行关注,了解和获取最新论文。