当前位置：首页 > news >正文

YOLO-UniOW概述论文

news 来源：原创 2025/5/30 17:21:47

论文：https://arxiv.org/abs/2412.20645
代码：https://github.com/THU-MIG/YOLO-UniOW

YOLO-UniOW: Efficient Universal Open-World Object Detection

YOLO UniOW：高效的通用开放世界对象检测

Abstract 摘要

Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight alignment in the CLIP latent space, achieving efficient detection without compromising generalization. Additionally, we design a Wildcard Learning strategy that detects out-of-distribution objects as “unknown” while enabling dynamic vocabulary expansion without the need for incremental learning. This design empowers YOLO-UniOW to seamlessly adapt to new categories in open-world environments. Extensive experiments validate the superiority of YOLO-UniOW, achieving achieving 34.6 AP and 30.0 APr on LVIS with an inference speed of 69.6 FPS. The model also sets benchmarks on M-OWODB, S-OWODB, and nuScenes datasets, showcasing its unmatched performance in open-world object detection. Code and models are available at https:github.com/THU-MIG/YOLO-UniOW .
传统的目标检测模型受到闭集数据集的限制，仅检测训练过程中遇到的类别。虽然多模态模型通过对齐文本和图像模态扩展了类别识别，但由于跨模态融合，它们引入了显著的推理开销，并且仍然受到预定义词汇的限制，使得它们在处理开放世界场景中的未知对象时无效。在这项工作中，我们引入了通用开放世界对象检测（Uni-OWD），这是一种将开放词汇表和开放世界对象探测任务相结合的新范式。为了解决这种设置的挑战，我们提出了YOLO-UniOW，这是一种新的模型，可以提高效率、多功能性和性能的界限。YOLO UniOW结合了自适应决策学习，以CLIP潜在空间中的轻量级对齐代替计算开销较大的跨模态融合，在不影响泛化的情况下实现高效检测。此外，我们设计了一种通配符学习策略，该策略将分布外的对象检测为“未知”，同时启用动态词汇表扩展，而不需要增量学习。这种设计使YOLO UniOW能够无缝地适应开放世界环境中的新类别。大量实验验证了YOLO UniOW的优越性，在LVIS上实现了34.6AP和30.0APR，推理速度为69.6FPS。该模型还设置了M-OWODB、S-OWODB和nuScenes数据集的基准测试，展示了其在开放世界对象检测中无与伦比的性能。代码和型号可在https://github.com/THU-MIG/YOLO-UniOW上获得。

图 1：速度-精度权衡曲线。YOLO-UniOW 和最近的方法在 LVIS minival 数据集上的速度和准确性的比较。推理速度是在没有TensorRT的单个NVIDIA V100 GPU上测量的。圆圈大小表示模型大小

1.Introduction 引言

Object detection has long been one of the most fundamental and widely applied techniques in the field of computer vision, with extensive applications in security [46], autonomous driving [57], and medical imaging [13].Many remarkable works have achieved breakthroughs for object detection, such as Faster R-CNN [41], SSD [30], RetinaNet [26], etc.
物体检测长期以来一直是计算机视觉领域最基本、应用最广泛的技术之一，在安全[46]、自动驾驶[57]和医学成像[13]方面有着广泛的应用。许多杰出的工作在目标检测方面取得了突破，如Faster R-CNN[41]、SSD[30]、RetinaNet[26]等。
In recent years, the YOLO (You Only Look Once) [1, 20, 40, 51] series of models has gained widespread attention for its outstanding detection performance and real-time efficiency.The recent YOLOv10 [51] establishes a new standard for object detection by employing a consistent dual assignment strategy, achieving efficient NMS-free training and inference.
近年来，YOLO (You Only Look Once) [1, 20, 40, 51] 系列模型因其出色的检测性能和实时性而受到广泛关注。最近的 YOLOv10 [51] 通过使用一致的双重分配策略为目标检测建立了一个新的标准，实现了高效的无 NMS 训练和推理。
However, traditional YOLO-based object detection models are often confined to a closed set definition, where objects of interest belong to a predefined set of categories.
然而，传统的基于YOLO的目标检测模型往往局限于闭集定义，其中感兴趣的对象属于一组预定义的类别。
In practical open-world scenarios, when encounteringunknown categories that have not been encountered in the training datasets, these objects are often misclassified as background. This inability of models to recognize novel objects can also negatively impact the accuracy of known categories, limiting their robust application in real-world scenarios.
在实际的开放世界场景中，当遇到训练数据集中没有遇到的未知类别时，这些对象通常会被误分类为背景。模型无法识别新对象也会对已知类别的准确性产生负面影响，限制了它们在现实世界场景中的稳健应用。

图 2. 检测框架的比较。(a) 具有跨模态融合的开放词汇检测器。(b) 我们具有自适应决策学习的高效开放词汇检测器。( c) 开放世界和开放词汇检测器。(d) 我们的用于开放词汇和开放世界任务的 Uni-OWD 检测器。

Thanks to the development of vision-language models, such as [3, 19, 39, 47], combining their open-vocabulary capabilities with the efficient object detection of YOLO presents an appealing and promising approach for real-time open-world object detection. YOLO-World [4] is a pioneering attempt, where YOLOv8 [20] is used as the object detector, and CLIP’s text encoder is integrated as an open-vocabulary classifier for region proposals (i.e., anchors in YOLOv8). The decision boundary for object recognition is derived from representations of class names generated by CLIP’s text encoder. Additionally, a visionlanguage path aggregation network (RepVL-PAN) using reparameterization [6, 50] is introduced to comprehensively aggregate text and image features for better cross-modality fusion.
由于视觉语言模型的发展，例如 [3, 19, 39, 47]，将它们的开放词汇能力与 YOLO 的高效对象检测相结合，为实时开放世界对象检测提供了一种很有吸引力且有前途的方法。YOLO-World[4]是一个开创性的尝试，其中YOLOv8[20]被用作对象检测器，CLIP的文本编码器被集成为区域建议的开放词汇分类器(即YOLOv8中的锚点)。物体识别的决策边界来源于CLIP文本编码器生成的类名的表示。此外，引入了使用重新参数化 [6, 50] 的视觉语言路径聚合网络 (RepVL-PAN)，以全面聚合文本和图像特征以获得更好的跨模态融合。
Although YOLO-World is effective for open-vocabulary object detection (OVD), it still relies on a predefined vocabulary of class names, which must include all categories that are expected to be detected. This reliance significantly limits its ability to dynamically adapt to newly emerging categories, as determining unseen class names in advance is inherently challenging, preventing it being truly openworld. Moreover, the inclusion of RepVL-PAN introduces additional computational costs, especially with large vocabulary sizes, making it less efficient for real-world applications.
尽管 YOLO-World 对开放词汇对象检测 (OVD) 有效，但它仍然依赖于预定义的类名称词汇表，这必须包括所有有望检测到的类别。这种依赖极大地限制了它动态适应新出现的类别的能力，因为提前确定看不见的类名称本身就具有挑战性，阻碍了它真正开放世界。此外，RepVL-PAN 的包含引入了额外的计算成本，尤其是在词汇量大的情况下，这使得它在实际应用中效率较低。
In this work, we first advocate a new setting of Universal Open-World Object Detection (Uni-OWD), in which we encourage realizing open-world object detection (OWOD) and open-vocabulary object detection (OVD) with one unified model. Specifically, it emphasizes that the model can not only recognize categories unseen during training but also effectively classify unknown objects as “unknown”. Additionally, we call for a efficient solution following YOLO-World to meet the efficiency requirement in realworld applications. To achieve these, we propose a YOLOUniOW model to achieve effective universal open-world detection but also enjoying greater efficiency.
在这项工作中，我们首先提倡一种新的通用开放世界对象检测设置（Uni-OWD），其中我们鼓励使用一个统一的模型实现开放世界对象检测（OWOD）和开放词汇对象检测（OVD）。具体来说，它强调模型不仅可以识别训练期间看不见的类别，还可以有效地将未知对象分类为“未知”。此外，我们在 YOLO-World 之后调用有效的解决方案来满足实际应用中的效率要求。为了实现这一点，我们提出了一个 YOLOUniOW 模型来实现有效的通用开放世界检测，但也享有更高的效率。
Our YOLO-UniOW emphasizes several insights for efficient Uni-OWD. (1) Efficiency. Except using recent YOLOv10 [51] for the more efficient object detector, we introduce a novel adaptive decision learning strategy, dubbed AdaDL, to wipe out the expensive cross-modality visionlanguage aggregation in RepVL-PAN, as illustrated in Fig. 2 (b). The goal of AdaDL is to adaptively capture taskrelated decision representations for object detection without sacrificing the generalization ability of CLIP. Therefore, we can well align the image feature and class feature directly in the latent CLIP space with no any heavy cross-modality fusion operations, achieving efficient and outstanding detection performance (see Fig. 1). (2) Versatility. The challenge of open-world object detection (OWOD) lies in differentiating all unseen objects with only one “unknown” category without any supervision about unknown objects. To solve this issue, we design a wildcard learning method that use a wildcard embedding to unlock generic power of open-vocabulary model. This wildcard embedding is optimized through a simple self-supervised learning, which seamlessly adapts to dynamic real-world scenarios. As shown in Fig. 2 (d), our YOLO-UniOW can not only benefit from the dynamic expansion of the known category set like YOLO-World, i.e., open-vocabulary detection, but also can highlight any out-of-distribution objects with “unknown” category for open-world detection. (3) High performance. We evaluated our zero-shot open-vocabulary capability in LVIS [14], and the open-world approach in benchmarks such as M-OWODB [44], S-OWODB [16], and nuScenes [2]. Experimental results show that our method can significantly outperform existing state-of-the-art methods for efficient OVD, achieving 34.6 AP, 30.0 APr on the LVIS dataset with the speed of 69.6 FPS. Besides, YOLO-UniOW can also perform well in both zero-shot and task-incremental learning for open-world evaluation. These well demonstrate the effectiveness of the proposed YOLO-UniOW.
我们的 YOLO-UniOW 强调了对高效 Uni-OWD 的几个见解。(1) 效率。除了使用最近的YOLOv10[51]进行更有效的对象检测器外，我们引入了一种新的自适应决策学习策略AdaDL，以消除RepVL-PAN中昂贵的跨模态视觉语言聚合，如图2 (b)所示。AdaDL的目标是在不牺牲CLIP泛化能力的情况下自适应地捕获用于目标检测的任务相关决策表示。因此，我们可以直接在潜在空间中对齐图像特征和类特征，而不需要任何繁重的跨模态融合操作，实现了高效、出色的检测性能(见图1)。(2) 多功能性。开放世界对象检测 (OWOD) 的挑战在于区分所有看不见的对象，只有一个“未知”类别，没有任何对未知对象的监督。为了解决这个问题，我们设计了一种通配符学习方法，该方法使用通配符嵌入来解锁开放词汇模型的通用能力。这种通配符嵌入通过简单的自我监督学习进行了优化，这无缝地适应了动态的现实世界场景。如图 2 (d) 所示，我们的 YOLO-UniOW 不仅可以受益于 YOLO-World 等已知类别集的动态扩展，即开放词汇检测，还可以突出任何具有“未知”类别的分布外对象用于开放世界检测。(3) 高性能。我们评估了LVIS[14]中的零镜头开放词汇能力，以及M-OWODB[44]、S-OWODB[16]和nuScenes[2]等基准测试中的开放世界方法。实验结果表明，该方法的性能明显优于现有的高效OVD方法，在LVIS数据集上实现了34.6 AP、30.0 APr，速度为69.6 FPS。此外，YOLO-UniOW 在开放世界评估的零样本和任务增量学习中也表现良好。这些很好地证明了所提出的 YOLO2 UniOW 的有效性。
The contributions of this work are as follows:
这项工作的贡献如下：

• We advocate a new setting of Universal Open-World Object Detection, dubbed Uni-OWD to solve the challenges of dynamic object categories and unknown target recognition with one unified model. We provide an efficient solution based on YOLO detector, ending up with our YOLO-UniOW.
我们提倡一种新的通用开放世界对象检测设置，称为 Uni-OWD，以解决具有统一模型的动态对象类别和未知目标识别的挑战。我们提供了一个基于 YOLO 检测器的有效解决方案，最终得到我们的 YOLO-UniOW。
• We design a novel adaptive decision learning (AdaDL) strategy to adapt the representation of decision boundary into the task of Uni-OWD without sacrificing the generalization ability of CLIP. Thanks to AdaDL, we can leave out the heave computation of cross-modality fusion operation used in previous works.
我们设计了一种新的自适应决策学习(AdaDL)策略，在不牺牲CLIP泛化能力的情况下，将决策边界的表示适应于Uni-OWD的任务。由于 AdaDL，我们可以省略以前工作中使用的跨模态融合操作的升沉计算。
• We introduce wildcard learning to detect unknown objects, enabling iterative vocabulary expansion and seamless adaptation to dynamic real-world scenarios. This strategy eliminates the reliance on incremental learning strategies.
我们引入了通配符学习来检测未知对象，从而实现迭代词汇扩展和无缝适应动态现实场景。该策略消除了对增量学习策略的依赖。
• Extensive experiment across benchmark for both openvocabulary object detection and open-world object detection show that YOLO-UniOW can significantly outperform existing methods, well demonstrating its versatility and superiority.
在开放词汇对象检测和开放世界对象检测的基准上进行的大量实验表明，YOLO-UniOW 可以显着优于现有方法，很好地展示了其多功能性和优越性。

2. Related Work 相关工作

2.1. Open-Vocabulary Object Detection 开放词汇表对象检测

Open-Vocabulary Object Detection (OVD) has emerged as a prominent research direction in computer vision in recent years. Unlike traditional object detection, OVD enables the detection dynamically expand categories without relying heavily on the fixed set of categories defined in the training dataset. Several works have explored leveraging Vision-Language Models (VLMs) for enhancing object detection. For instance, [4, 24, 28, 32, 42, 59, 60, 65, 68] utilize large-scale, easily accessible text-image pairs for pretraining, resulting in more robust and generalizable detectors, which are subsequently fine-tuned on specific target datasets. In parallel, [12, 36, 38, 53] focus on distilling the alignment of visual-text knowledge from VLMs into object detection, emphasizing the design of distillation losses and the generation of object proposals. Additionally, [7, 11, 54] investigate various prompt modeling techniques to more effectively transfer VLM knowledge to the detector, enhancing its performance in open-vocabulary and unseen category tasks.
近年来，Open-Vocabulary 对象检测 (OVD) 已成为计算机视觉的一个重要研究方向。与传统的目标检测不同，OVD 能够动态扩展类别，而不依赖于训练数据集中定义的固定类别集。一些工作探索了利用视觉语言模型 (VLM) 来增强目标检测。例如，[4，24，28，32，42，59，60，65，68]利用大规模、易于访问的文本图像对进行预训练，从而产生更健壮和可概括的检测器，随后在特定目标数据集上进行微调。同时，[12,36,38,53]专注于从VLM中提取视觉-文本知识的对齐到目标检测中，强调蒸馏损失的设计和对象提议的生成。此外，[7,11,54]。研究了各种提示建模技术，以更有效地将VLM知识转移到检测器中，提高了它在开放词汇和看不见的类别任务中的表现。

2.2. Open-World Object Detection 开放世界对象检测

Open-World Object Detection (OWOD) is an emerging direction in object detection, aiming to address the challenge of dynamic category detection. The goal is to enable detection models to identify known categories while recognizing unknown categories, and to incrementally adapt to new categories over time. Through methods such as manual annotation or active learning [31, 43, 62], unknown categories can be progressively converted into known categories, facilitating continuous learning and adaptation.
开放世界目标检测(OWOD)是目标检测的一个新兴方向，旨在解决动态类别检测的挑战。目标是使检测模型能够识别已知类别，同时识别未知类别，并随着时间的推移逐步适应新的类别。通过手动注释或主动学习等方法[31,43,62]，可以将未知类别逐步转换为已知类别，促进持续学习和适应。
The concept of OWOD was first introduced by Joseph et al. [21], whose framework relies on incremental learning. By incorporating an energy-based object recognizer into the detection head, the model gains the ability to identify unknown categories. However, this method depends on replay mechanisms, requiring access to historical task data to update the model. Additionally, it often exhibits a bias toward known categories when handling unknown objects, limiting its generalization capabilities. To address these limitations, many subsequent studies have been proposed. For instance, [35, 67] improved the experimental setup for OWOD by introducing more comprehensive benchmark datasets and stricter evaluation metrics, enhancing the robustness of unknown category detection. While these improvements achieved promising results in controlled experimental settings, their adaptability to complex scenarios and dynamic category changes remains inadequate. Recent research has shifted focus toward optimizing the feature space to better separate known and unknown categories. Methods such as [9, 48, 55, 61] propose advancements in feature space extraction, enabling models to more effectively extract feature information for the localization and identification of unknown objects. Recently, several methods [25, 34, 71] have emerged, leveraging pretrained models for open-world object detection and achieving significant improvements.
Joseph等人[21]首先介绍了OWOD的概念，该框架依赖于增量学习。通过将基于能量的对象识别器合并到检测头中，该模型获得了识别未知类别的能力。然而，这种方法依赖于重放机制，需要访问历史任务数据来更新模型。此外，在处理未知对象时，它通常对已知类别表现出偏差，从而限制了其泛化能力。为了解决这些限制，已经提出了许多后续研究。例如，[35,67]通过引入更全面的基准数据集和更严格的评估指标改进了OWOD的实验设置，增强了未知类别检测的鲁棒性。虽然这些改进在受控实验设置中取得了有希望的结果，但它们对复杂场景的适应性和动态类别变化仍然不足。最近的研究已转向优化特征空间以更好地分离已知和未知类别。[9,48,55,61]等方法提出了特征空间提取的进步，使模型能够更有效地提取特征信息，用于未知对象的定位和识别。最近，出现了几种方法 [25, 34, 71]，利用预训练模型进行开放世界对象检测并实现显着的改进。

2.3. Parameter Efficient Learning 参数高效学习

Prompt learning has emerged as a significant research direction in both natural language processing (NLP) and computer vision. By providing carefully designed prompts to pre-trained large models such as [39], prompt learning enables models to perform specific tasks in unsupervised or semi-supervised settings efficiently. Methods such as [17, 23, 56, 58, 69, 70] introduce learnable prompt embeddings, moving beyond fixed, handcrafted prompts to enhance flexibility across various visual downstream tasks. And DetPro [7] is the first to apply it to open-vocabulary object detection, achieving significant improvements using learnable prompts derived from text inputs.
快速学习已成为自然语言处理 (NLP) 和计算机视觉的重要研究方向。通过提供精心设计的提示来预训练大型模型，例如 [39]，提示学习使模型能够有效地在无监督或半监督设置中执行特定任务。[17,23,56,58,69,70]等方法引入了可学习的提示嵌入，超越了固定的手工提示，以增强各种视觉下游任务的灵活性。DetPro [7] 是第一个将其应用于开放词汇对象检测的人，使用从文本输入派生的可学习提示取得了显着的改进。
Low-Rank Adaptation (LoRA) [18] and its derivatives [29, 63, 64], as a parameter-efficient fine-tuning technique, has demonstrated outstanding performance in adapting large models. By inserting trainable low-rank decomposition modules into the weight matrices of pre-trained models without altering the original weights, LoRA significantly reduces the number of trainable parameters. CLIPLoRA [63] introduces LoRA into VLM models as a replacement for adapters and prompts, enabling fine-tuning for downstream tasks with faster training speeds and improved performance.
低秩自适应（LoRA）[18]及其衍生物[29,63,64]作为一种参数高效的微调技术，在适应大型模型方面表现出色。通过在不改变原始权重的情况下将可训练的低秩分解模块插入预训练模型的权重矩阵中，LoRA显著减少了可训练参数的数量。CLIPLoRA[63]将LoRA引入VLM模型，作为适配器和提示的替代品，能够以更快的训练速度和更高的性能对下游任务进行微调。
在这里插入图片描述

图 3. 我们提出的高效通用开放世界对象检测管道。Open-Vocabulary Pretraining（左）：使用多模态双头匹配进行有效的端到端目标检测，文本编码器中的 AdaDL 进行自适应决策边界学习。开放世界微调（右）：利用校准的文本嵌入和检测器在通配符的帮助下自适应地检测已知和未知对象。采用过滤策略去除重复的未知预测，确保高效有效的开放世界目标检测。

3. Efficient Universal Open-World Object Detection 高效的通用开放世界目标检测

3.1. Problem Definition 问题定义

Universal Open-World Object Detection (Uni-OWD) extends the challenges of Open Vocabulary Detection (OVD) and Open-World Object Detection (OWOD), aiming to create a unified framework that not only detects known objects in the vocabulary but also dynamically adapts to unknown objects while maintaining scalability and efficiency in realworld scenarios.
通用开放世界对象检测 (Uni-OWD) 扩展了开放词汇检测 (OVD) 和开放世界对象检测 (OWOD) 的挑战，旨在创建一个统一框架，不仅可以检测词汇表中的已知对象，还可以动态适应未知对象，同时保持现实场景的可扩展性和效率。
Define the object category set as $C = C_k ∪ C_{unk}$ , where $C_k$ represents the set of known categories, $C_{unk}$ represents the set of unknown categories, and $C_k ∩ C_{unk} = ∅$ . Given an input image $\mathcal{I}$ and a vocabulary $\mathcal{V}$ , the goal of Uni-OWD is to design a detector $\mathcal{D}$ that satisfies the following objectives:
定义对象类别集为 $C = C_k∪C_{unk}$ ，其中 $C_k$ 表示已知类别集， $C_{unk}$ 表示未知类别集， $C k \cap C u nk = \emptyset$ 。给定一个输入图像 $\mathcal{I}$ 和一个词汇表 $\mathcal{V}$ ，Uni-OWD 的目标是设计一个满足以下目标的检测器 $\mathcal{D}$ ：

For each category $c_k ∈ C_k$ , represented by its text $\mathcal{T}_{ck}∈ \mathcal{V}$ , the detector $\mathcal{D}$ should accurately predict the bounding boxes $\mathcal{B}_{ck}$ and their associated category labels $c_k$ by $\mathcal{D}(\mathcal{I}, \mathcal{V}) → \{(b, c_k) | b ∈ \mathcal{B}_{ck} , c_k ∈ C_k\}$
对于每个类别 $c_k∈C_k$ ，由其文本 $\mathcal{T}_{ck}∈ \mathcal{V}$ 表示，检测器 $\mathcal{D}$ 应该通过 $\mathcal{D}(\mathcal{I}, \mathcal{V})→\{(b, c_k) | b∈\mathcal{B}_{ck}, c_k∈C_k\}$
For objects belonging to $C_{unk}$ , the detector should identify their bounding boxes $\mathcal{B}_{ck}$ and assign them the generic label “unknown” with a wildcard $\mathcal{T}_w$ , such that: $D(I, T_w) → \{(b, unknown) | b ∈ B_{unk}\}$
对于属于 $C_{unk}$ 的对象，检测器应识别其边界框 $\mathcal{B}_{unk}$ ，并为其分配带有通配符 $\mathcal{T}_w$ 的通用标签“unknown”，例如： $D(I, T_w) → \{(b, unknown) | b ∈ B_{unk}\}$
The detector can iteratively expand the known category set $C_k$ and vocabulary $\mathcal{V}$ by discovering new categories $C_{new}$ from $C_{unk}$ , represented as $C^{t+1}_k = C^t_k ∪ C_{new}$
检测器可以通过从 $C_{unk}$ 中发现新的类别 $C_{new}$ 来迭代地扩展已知的类别集 $C_k$ 和词汇表 $\mathcal{V}$ ，表示为 $C^{t+1}_k = C^t_k ∪ C_{new}$
The Uni-OWD framework is designed to develop a detector that leverages a textual vocabulary and a wildcard to identify both known and unknown object categories within an image, combining the strengths of open-vocabulary and open-world detection tasks. It ensures precise detection and classification for known categories while assigning a generic “unknown” label to unidentified objects. This design promotes adaptability and scalability, making it wellsuited for dynamic and real-world applications.
Uni-OWD框架旨在开发一种检测器，利用文本词汇表和通配符来识别图像中已知和未知的物体类别，结合了开放词汇和开放世界检测任务的优点。它确保对已知类别进行精确检测和分类，同时为未识别物体分配通用的"未知"标签。这种设计增强了适应性和可扩展性，使其非常适合动态的真实世界应用场景。

3.2. Efficient Adaptive Decision Learning 高效自适应决策学习

Designing a universal open-world object detection model suitable for deployment on edge and mobile devices demands a strong emphasis on efficiency. Traditional openvocabulary detection models [4, 28, 42, 65] align text and image modalities by introducing fine-grained fusion operations in the early layers. Then they rely on contrastive learning for both modalities to establish decision boundaries for object classification, enabling the model to adapt dynamically to novel classes during inference by leveraging new textual inputs.
设计一个适用于边缘和移动设备的通用开放世界目标检测模型需要高度重视效率。传统的开放词汇检测模型 [4, 28, 42, 65] 通过在早期层引入细粒度融合操作来对齐文本和图像模态。然后它们依赖两种模态的对比学习来建立目标分类的决策边界，使模型能够通过利用新的文本输入在推理过程中动态适应新类别。
YOLO-World [4] proposed an efficient architecture, RepVL-PAN, to perform image-text fusion through reparameterization. Despite its advancements, the model’s inference speed is still heavily influenced by the number of textual class inputs. This poses a challenge for low-compute devices, where performance degrades sharply as the number of text inputs increases, making it unsuitable for real-time detection tasks in complex, multi-class scenarios. To address this, we propose a adaptive decision learning strategy (AdaDL) to eliminate the heavy early-layer fusion operation.
YOLO-World [4]提出了一种高效架构RepVL-PAN，通过重参数化实现图文融合。尽管取得了进展，该模型的推理速度仍受文本类别输入数量的严重影响。这对低算力设备构成挑战：随着文本输入量增加，性能急剧下降，使其难以胜任复杂多类别场景的实时检测任务。为此，我们提出自适应决策学习策略（AdaDL）来消除繁重的早期层融合操作。
During the construction of decision boundaries, most existing methods freeze the text encoder and rely on pretrained models, such as BERT[5] or CLIP[39], to extract textual features for interaction with visual features. Without a fusion structure, the text features struggle to capture image-related information dynamically, leading to suboptimal multimodal decision boundary construction when adjustments are made solely to the image features. To overcome this, our AdaDL strategy aims to enhance the decision representation during training for the Uni-OWD scenario. Specifically, during training, we introduce efficient parameters into the text encoder by incorporating Low-Rank Adaptation (LoRA) into all query, key, value and output projection layers, which can be described as:
在构建决策边界的过程中，现有方法大多会冻结文本编码器并依赖预训练模型（如BERT[5]或CLIP[39]）提取文本特征以与视觉特征交互。由于缺乏融合结构，文本特征难以动态捕捉图像相关信息，导致仅调整图像特征时多模态决策边界的构建效果欠佳。为解决这一问题，我们提出的AdaDL策略旨在增强Uni-OWD场景下训练时的决策表征。具体而言，在训练阶段，我们通过将低秩自适应（LoRA）引入所有查询、键、值和输出投影层，向文本编码器注入高效参数，该过程可表述为：

$'^{x} = W_0x + ∆W x \tag{1}$
where $W_0$ represents the pretrained weights of the CLIP text encoder, and $∆ W$ is the product of two low-rank matrices. The model’s input and output are $x$ and $h$ . The rank is set to a value much smaller than the model’s feature dimension. This strategy ensures that the pre-trained parameters of the text encoder remain unchanged while low-rank matrices dynamically store information related to cross-modality interactions during training. By continuously calibrating the outputs of text encoder, this method allows the decision boundaries constructed by both modalities to adapt more effectively to each other. In practice, the calibrated text embeddings can be precomputed and stored offline during inference, thereby avoiding the computational cost of the text encoder.
其中， $W_0$ 表示CLIP文本编码器的预训练权重， $Δ W$ 由两个低秩矩阵乘积构成。模型的输入和输出分别为 $x$ 与 $h$ 。秩的取值远小于模型特征维度。该策略确保文本编码器的预训练参数保持不变，同时低秩矩阵在训练期间动态存储跨模态交互相关信息。通过持续校准文本编码器的输出，此方法使双模态构建的决策边界能更有效地相互适配。实际应用中，经过校准的文本嵌入可在推理阶段预先计算并离线存储，从而规避文本编码器的计算开销。
YOLOv10 as the efficient object detector. To improve the efficiency, we accommodate the proposed adaptive decision learning strategy into the recent advanced YOLOv10 [51] as the efficient object detector. We employ a multimodal dual-head match to adapt the decision boundary for both classification heads in YOLOv10. Specifically, during the region-text contrastive learning between the region anchor and the class text, we refine the region embeddings from two heads by aligning them with shared, semantically rich text representations, enabling seamless endto-end training and inference. Furthermore, we integrate a consistent dual alignment strategy for region contrastive learning, where the dual-head matching process is formalized as:
YOLOv10作为高效目标检测器。为提升效率，我们将提出的自适应决策学习策略融入当前先进的YOLOv10框架中作为高效目标检测器。我们采用多模态双头匹配机制来适配YOLOv10中两个分类头的决策边界。具体而言，在区域锚点与类别文本之间进行区域-文本对比学习时，通过将双头输出的区域嵌入与共享的语义丰富文本表征对齐，从而精炼区域嵌入表示，实现无缝的端到端训练与推理。此外，我们还整合了区域对比学习的双头一致性对齐策略，其双头匹配过程可形式化表示为：
$s^α × u^β \tag2$
where $u$ represents the IoU value between the predicted box and the ground truth box. $s$ is the classification score obtained by multi-modal information, which is derived as:
其中 $u$ 代表预测框与真实框之间的IoU值。 $s$ 是通过多模态信息获得的分类分数，其计算公式为：
$\tag3$
where $s im (\cdot,\cdot)$ is the cosine similarity, $T$ is the embeddings from the text $\mathcal{T} ∈ \mathcal{V}$ and $I$ is the pixel-level features from image $\mathcal{L}$ . To ensure minimal supervision gap between the both heads during multimodel dual-head matching, we adopt the consistent settings, where $α_{o2o} = α_{o2m}$ and $β_{o2o} = β_{o2m}$ . This allows the one-to-one head to effectively learn consistent supervisory signals with one-to-many head.
其中 $s im (\cdot,\cdot)$ 表示余弦相似度， $T$ 代表文本 $\mathcal{T} ∈ \mathcal{V}$ 的嵌入向量， $I$ 表示图像 $\mathcal{L}$ 的像素级特征。为确保多模态双头匹配过程中两个头之间的监督差距最小化，我们采用一致性设置，令 $α_{o2o} = α_{o2m}$ 且 $β_{o2o} = β_{o2m}$ 。这使得一对一匹配头能够有效地学习与一对多匹配头相一致的监督信号。
在这里插入图片描述

图4. 已知/通配类别学习流程。先前已知类别的文本嵌入保持冻结，而当前已知类别的嵌入通过真实标签进行微调。"未知"通配类别由经过良好调优的通配预测生成的伪标签进行监督。图中显示经过调优的通配预测分数，以及与已知类别真实标注框具有低置信度分数或高IoU值（虚线框）的预测框会被过滤掉。

As a result, the calibrated text encoder and YOLO structure can operate entirely independently in the early stages, eliminating the need for fusion operations while efficiently adapting to better multimodal decision boundaries.
因此，校准后的文本编码器和YOLO结构在早期阶段可以完全独立运行，无需融合操作，就能高效适应更优的多模态决策边界。

3.3. Open-World Wildcard Learning 开放世界通配符学习

In the previous section, we introduced the AdaDL to improve the efficiency of open-vocabulary object detection, mitigating the impact of large input class text on inference latency meanwhile improves its performance. This strategy enables real-world applications to expand the vocabulary while maintaining high efficiency, covering as many objects as possible. However, open-vocabulary models inherently rely on predefined vocabularies to detect and classify objects, which limits their capability in real-world scenarios. Some objects are difficult to predict or describe using textual inputs, making it challenging for open-vocabulary models to detect these out-of-vocabulary instances.
在前一部分中，我们介绍了AdaDL以提高开放词汇目标检测的效率，在减轻大输入类别文本对推理延迟影响的同时提升了性能。该策略使实际应用能够在保持高效率的同时扩展词汇量，尽可能覆盖更多目标。然而，开放词汇模型本质上依赖于预定义词汇来检测和分类目标，这限制了其在现实场景中的能力。某些目标难以通过文本输入进行预测或描述，使得开放词汇模型检测这些词汇外实例具有挑战性。
To address this, we propose a wildcard learning approach that enables the model to detect objects not present in the vocabulary and label them as “unknown” rather than ignoring them. Specifically, we directly leverage a wildcard embedding to unlock generic power of open-vocabualry model. As shown in Tab. 4, after the decision adaptation, the wildcard $\mathcal{T}_w$ (e.g. “object”) demonstrates remarkable capability in capturing unknown objects within a scene in a zero-shot manner. To further enhance its effectiveness, we fine-tune its text embedding on the pretraining dataset for a few epochs. During this process, all ground-truth instances are treated as belonging to the same “object” class. This fine-tuning enables the embedding to capture richer semantics, empowering the model to identify objects that might have been overlooked by the predefined specific classes.
为此，我们提出了一种通配符学习方法，使模型能够检测词汇表中未出现的对象，并将其标记为"未知"而非忽略。具体而言，我们直接利用通配符嵌入来释放开放词汇模型的泛化能力。如表4所示，经过决策适配后，通配符 $\mathcal{T}_w$ （如"物体"）以零样本方式展现出捕捉场景中未知物体的卓越能力。为进一步提升效果，我们在预训练数据集上对其文本嵌入进行了少量轮次的微调。在此过程中，所有真实实例均被视为属于同一个"物体"类别。这种微调使嵌入能够捕获更丰富的语义，从而让模型可以识别那些可能被预定义特定类别所遗漏的物体。
To avoid duplicate predictions for known classes, we utilize this well-tuned wildcard embedding $T_{obj}$ to teach an “unknown” wildcard embedding $T_{unk}$ . The “unknown” wildcard is trained in a self-supervised manner without relying on ground-truth labels of “unknown” class. As shown in Fig. 4, predictions that has the highest similarity score with $T_{obj}$ across all known classes embeddings are used as pseudo label candidates. To further refine these candidates, we introduce a simple selection process:
为了避免对已知类别进行重复预测，我们利用这个经过良好调整的通配符嵌入 $T_{obj}$ 来指导“未知”通配符嵌入 $T_{unk}$ 的训练。该“未知”通配符以自监督的方式进行训练，无需依赖“未知”类别的真实标注标签。如图4所示，在与所有已知类别嵌入中与 $T_{obj}$ 相似度得分最高的预测被用作伪标签候选。为了进一步优化这些候选结果，我们引入了一个简单的筛选流程：
$u)=\begin{cases} 1, & if(u < σ1) ∧ (s > σ2) \\ 0, & otherwise \end{cases} \tag4$
where $u$ is the maximum IoU between predictions and known class ground truth boxes. And the predictions with $u$ below a threshold $σ_1$ or classification score $s$ above a threshold $σ_2$ are selected. The remaining predictions are assigned to Tunk as target labels.
其中， $u$ 表示预测框与已知类别真实框之间的最大交并比（IoU）。当预测框的 $u$ 低于阈值 $σ_1$ 或分类分数 $s$ 高于阈值 $σ_2$ 时，这些预测框将被选中。其余的预测框则被分配为 $T_{unk}$ 作为目标标签。