当前位置：首页 > news >正文

通过观看数百个外科手术视频讲座来学习多模态表征|文献速递-最新论文分享

news 2025/7/4 11:44:21

Title

题目

Learning multi-modal representations by watching hundreds of surgical video lectures

通过观看数百个外科手术视频讲座来学习多模态表征

文献速递介绍

外科计算机视觉领域的最新进展，已开始为手术室（OR）的新一代人工智能辅助支持系统铺平道路（Maier-Hein 等人，2017，2022；Ward 等人，2021；Mascagni 等人，2022；Madani 等人，2020；Yuan 等人，2021）。该领域取得了重大进展，从粗略的手术流程识别（Blum 等人，2008，2010；Padoy 等人，2012；Twinanda 等人，2016；Dergachyova 等人，2016），发展到通过手术动作三元组（Nwoye 等人，2022）、像素级场景分割（Allan 等人，2019；Alapatt 等人，2021）和手术场景重建（Wang 等人，2022；Pfeiffer 等人，2019；Rivoir 等人，2021）实现细粒度的手术场景理解。然而，当前的进展存在三个主要局限性。首先，这些方法在很大程度上集中于构建特定任务的全监督深度学习模型，这需要临床专家付出大量努力来生成带标签的真值数据。其次，这些方法的有效性主要在数量有限的单中心、特定手术流程的手术视频数据集上得到验证，而这些数据集不足以涵盖整个手术流程的复杂细节（Eisenmann 等人，2022）。第三，这些方法在设计中没有明确整合自然语言文本的丰富语义。包含广泛视觉概念的自然语言文本，可作为视觉模型的自然监督信号，确保其对多种下游任务具有较高的通用性和可用性。那些能够利用最少标注的自然语言监督扩展到多个下游任务，同时充分利用大规模多手术流程视频的方法，将在扩大这些方法的广泛应用方面发挥重要作用。在通用计算机视觉领域，结合视觉和自由形式自然文本信息的多模态表征学习（Radford 等人，2021；Miech 等人，2020）正成为一种可行的替代方案，可避免为不同下游任务收集带标签训练数据的需求（Radford 等人，2021；Jia 等人，2021）。这些方法旨在通过在大规模配对的视觉-文本输入上预训练两个并行编码器（一个用于视觉，一个用于文本），来学习低维的联合潜在空间。两种模态的共享潜在空间支持零样本迁移学习，即预训练的视觉和文本编码器能够适应不同的下游任务，而无需使用特定任务的标签进行微调。这一突破在众多通用计算机视觉应用中取得了令人瞩目的成果，包括零样本图像分类（Radford 等人，2021）、图像 caption 生成（Nukrai 等人，2022）、语义图像检索（Sain 等人，2023）和文本到形状生成（Sanghi 等人，2022）。考虑到多模态表征学习领域的这一显著进展，一个自然的问题随之产生：能否为外科计算机视觉学习此类高级联合表征？如果可能，这将是外科数据科学发展的重要一步（Maier-Hein 等人，2022）。通过获得此类表征，我们不仅能够在不使用特定任务标签的情况下执行现有的手术视频分析任务，如从粗粒度到细粒度的手术流程识别（Twinanda 等人，2016；Nwoye 等人，2022），还能为手术室中可扩展的智能认知辅助开辟新途径。这包括视觉-语言应用，如手术视觉问答（Seenivasan 等人，2022）、手术报告生成（Xu 等人，2021b），以及促进临床医生与手术设备之间的交互式通信。本研究引入了 SurgVLP（外科视觉-语言预训练），这是一种用于外科计算机视觉的大规模多模态表征学习深度学习方法。开发此类方法并非没有独特挑战。主要障碍之一是，与通用计算机视觉领域中数百万的多模态视觉-文本对相比，缺乏大规模的多模态多手术流程外科数据集（Radford 等人，2021；Grauman 等人，2022；Miech 等人，2019）。例如，最近开发的 Ego4D（Grauman 等人，2022）数据集收集了 3000 小时的活动视频，并对其进行了人工叙述。由于收集和标注手术视频需要大量人力，此类方法在外科领域难以实现。作为我们的第一项贡献，我们提议利用通过开放式外科电子学习平台（如 WebSurg（Websurg，2023）和 EAES（EAES，2023））以及在线视频分享平台（如 YouTube（YouTube，2023））获取的手术视频讲座，进行视觉-文本多模态学习。与人工标注的医学影像报告（Chen 等人，2022a）或手术指令（Rojas-Muñoz 等人，2020）相比，我们提议将未处理且可能含噪声的音频作为多模态表征学习的主要监督来源。我们利用语音识别（ASR）的最新进展（Mehrish 等人，2023），将讲座音频转录为句子，并将其与相应的视频片段关联，构建大量的视频片段-文本对，如图 1 所示。由此产生的手术视频讲座（SVL）数据集包含了各种手术流程中手术事件、器械使用和解剖状态的多样化描述，从而为外科多模态表征学习提供了足够的监督。然而，使用 SVL 数据集进行多模态表征学习面临若干语言挑战。首先，这些视频中描述的外科概念使用特定领域的知识和科学术语，这在通用计算机视觉中并不常见。例如，“抓住胆囊颈部并将其向左下腹牵拉以打开胆囊三角”以及“在连接 Rouviere 沟和肝第四段基部的假想安全线上方进行解剖”，是腹腔镜胆囊切除术手术视频讲座中常见的特定手术描述。此外，手术视频片段与其相应的文本描述之间可能存在语义错位。实际上，描述手术流程的讲师可能会偏离当前病例，回忆起一个类似的有出血事件的病例，即使相关视频中并未显示该出血事件。此外，这些视频存在长程依赖关系。例如，讲师可能会评论充分解剖以获得无张力吻合的重要性，即使该解剖步骤在手术开始时已展示或已被剪辑掉。最后，尽管最新的 ASR 模型（Chen 等人，2022b；Radford 等人，2023）能够有效转录日常 speech，但如前所述，由于外科特定的语言挑战，它们在外科场景中的性能并不理想。例如，最先进的 ASR Whisper 模型（Radford 等人，2023）能够理解句子结构和常用词汇，但在处理外科特定术语时存在困难（例如，将“jejunostomy（空肠造口术）”转录为“egenostomy”）。商业医疗专用解决方案，如 AWS（2023），在转录医学术语方面表现更佳，但往往无法捕捉句子的整体结构和边界。我们提出了两项关键技术用于开发特定于外科的多模态表征学习。首先，我们采用来自两个含噪声但互补的 ASR 系统（即 Whisper（Radford 等人，2023）和 AWS（2023））的文本转录，以获得改进的学习过程监督信号，如图 1 所示，有效缓解每个系统各自的局限性和不准确性。其次，我们提出了一种新的对比学习目标，该目标利用来自 ASR 系统的双重文本转录以及相应的视频片段。所提出的对比学习目标旨在促使视频片段和相应双重文本转录的嵌入向量在联合潜在空间中接近。通过这种方式，所学的多模态表征保留了含噪声 ASR 转录中存在的共同语义，实现视觉和文本信息的更有效融合。为了有效展示所学联合潜在空间的表征能力，我们引入了多种外科视觉-语言任务作为多模态评估基准。这些任务包括基于文本的视频检索、 temporal 活动定位和视频 caption 生成。基于文本的视频检索任务旨在将给定的文本查询与各种视频片段相关联，而 temporal 活动定位任务则涉及将给定的文本查询定位到整个视频中的特定视频片段。这两项任务检验了联合潜在空间对手术视觉信息及其文本描述中固有潜在关系的捕捉程度。视频 caption 生成任务旨在为给定的手术视频片段生成 caption。由于这是一项生成任务，它需要使用文本解码器来生成连贯的文本输出。我们提出了一种构建文本解码器的方法，并将其附加到我们预训练的编码器上，从而将我们的预训练模型无缝地重新用于视频 caption 生成器。整个过程仅需要文本数据来训练文本解码器模型。我们证明，在所有视觉-语言任务中，我们的方法相比基线方法均有显著改进。接下来，我们评估了我们的方法在应用于未见过的外科数据集和任务时的稳健性和适应性。具体而言，我们考察了其在传统纯视觉外科任务中的性能，包括手术工具、手术阶段和动作三元组识别（Twinanda 等人，2016；Nwoye 等人，2022）。我们通过将类别标签（工具、阶段或动作三元组）转换为文本形式，并基于视觉和文本潜在向量的相似性对视频帧进行分类，来评估我们的方法作为零样本迁移学习的性能。结果表明，我们从各种手术流程中通过多模态联合表征学到的通用外科概念，能够对特定手术流程（如腹腔镜胆囊切除术）有所帮助。据我们所知，这是第一项展示无需标注即可通过自监督多模态预训练来识别手术工具、手术阶段和动作三元组的研究。虽然我们的零样本性能落后于全监督基线，特别是在需要细粒度解剖推理的任务中，但结果凸显了 SurgVLP 作为基础骨干模型减少下游任务标注成本的潜力。最后，我们进行了大量消融研究，以阐明我们方法的不同组件及其对结果的影响。我们工作的贡献可简要概括为以下四个关键方面： - 我们提议利用可通过开放式外科电子学习平台获取的手术视频讲座知识，进行视觉-文本多模态表征学习。为此，我们引入了一个大规模的手术视频讲座（SVL）数据集，包含 1.4k 个手术流程视频。 - 我们提议利用来自两个互补的 ASR 系统（Whisper 和 AWS）的文本转录，通过解决这些 ASR 系统产生的语言不准确句子，来增强表征学习过程。 - 我们提出了一种新颖的对比学习目标，该目标利用来自 ASR 系统的双重文本转录和相应的视频片段，旨在促使嵌入向量在联合潜在空间中接近。 - 我们展示了我们提出的框架在多个视觉-语言和纯视觉任务中的零样本迁移能力。

Abatract

摘要

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgeryspecific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP — Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP’s potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications.

近年来，外科计算机视觉应用的进展主要由纯视觉模型推动，这类模型在设计中并未明确整合丰富的语言语义。这些方法依赖人工标注的手术视频来预测固定的物体类别集，这限制了它们对未见过的外科手术流程和下游任务的泛化能力。在本研究中，我们提出一种理念：通过开放式外科电子学习平台获取的手术视频讲座，无需依赖人工标注，就能为多模态表征学习提供有效的视觉和语言监督信号。为解决手术视频讲座中存在的特定于外科领域的语言挑战，我们采用了多个互补的自动语音识别系统来生成文本转录内容。随后，我们提出了一种新的方法——SurgVLP（外科视觉-语言预训练），用于多模态表征学习。SurgVLP构建了一种新的对比学习目标，通过将视频片段嵌入与相应的多个文本嵌入在一个联合潜在空间中聚合，实现二者的对齐。为有效验证所学习到的联合潜在空间的表征能力，我们引入了多项外科领域的视觉-语言任务，并评估了多种特定于外科的纯视觉任务，例如手术工具识别、手术阶段识别和三元组识别。在多种外科手术流程和任务上的大量实验表明，SurgVLP学习到的多模态表征在手术视频分析中表现出强大的可迁移性和适应性。此外，我们的零样本评估结果凸显了SurgVLP作为外科工作流分析通用基础模型的潜力，它减少了下游任务对大量人工标注的依赖，并有助于通过少样本学习等适配方法，为各种下游外科应用构建可扩展且数据高效的解决方案。

Conclusion

结论

The expensive and laborious process of creating manually annotated datasets has been a main hindrance in developing scalable surgical computer vision AI systems. In this work, we argue that surgical video lectures available through open surgical e-learning platforms can provide a wealth of multi-modal knowledge to train a scalable system for multi-modal representation learning. We have harnessed this knowledge by creating a multi-modal and multi-procedural dataset comprising 1.4k surgical video lectures. In order to derive effective supervisory signals without manual annotations, we utilize the recent advancements in automatic speech recognition (ASR) systems to transcribe the audio from these videos into textual descriptions. This automated process has resulted in a visual–textual multi-modal surgical dataset consisting of descriptions of surgical events, instrument usage, and anatomical status across various surgical procedures. In order to tackle the surgery-specific linguistic challenges inherently present in these videos, we utilize text transcriptions from two complementary ASR models, namely Whisper and AWS. The AWS model captures specific surgical terms, whereas the Whisper model captures the overall sentence structure. By combining the complementary knowledge of these two systems, we overcome inherent limitations and inaccuracies present in each ASR system. We then propose a novel contrastive learning objective for multi-modal representation learning. Our approach, called SurgVLP, learns effective multi-modal representations by bringing embeddings of multiple text transcriptions and video clip to close proximity in the joint latent space. To demonstrate the efficacy of the learned joint latent space, we present a range of vision-and-language tasks tailored for surgical computer vision. These tasks include text-based video retrieval, temporal activity grounding, and video captioning, serving as benchmarks for evaluating the multi-modal representation capability of SurgVLP. We demonstrate that the learned multi-modal representations are not only useful for these vision-and-language tasks but can also be seamlessly applied to traditional vision-only surgical downstream tasks. We show promising results on these vision-only surgical tasks, namely surgical tool, phase, and triplet recognition, without using any manual annotations.

6.1. 讨论 6.1.1. 未来工作本研究表明，所提出的SurgVLP在零样本性能上优于通用计算机视觉领域的最先进方法（Radford等人，2021）。这一优异性能得益于构建的大规模外科视觉-语言数据集以及采用多文本视图的预训练策略。然而，SurgVLP的零样本适配未借助任何标注数据的监督，因此与全监督方法（Twinanda等人，2016；Czempiel等人，2020）相比，性能仍有不足。针对实际应用场景，一个潜在的改进方向是：利用少量标注数据对预训练SurgVLP的多模态表征进行全监督微调，使其适配下游任务。具体而言，SurgVLP的双分支架构能够在编码领域特定文本知识的同时，捕捉外科场景中的详细视觉模式（Kan等人，2023）。因此，全监督方法（Twinanda等人，2016；Czempiel等人，2020）的特征提取器可通过文本侧的互补信息得到增强。另一个未来的研究方向是探索视觉和文本模态中“成本更低”的自监督信号。典型工作包括构建外部知识库（Shen等人，2022）以及进行检索增强的视觉-语言预训练（Xie等人，2023）。鉴于近年来大型语言模型（Touvron等人，2023）的兴起及其所编码的临床知识，通过挖掘这些语言模型的知识来探索其应用，有助于弥合领域差距。此外，当前工作忽略了外科视频中固有的层级结构。为解决这一问题，可引入层级化多模态预训练，以进一步提升需要长时序上下文进行预测的外科下游任务的性能。

Figure

图

Fig. 1. Examples of video clip-text pairs from SVL dataset. The video clip-text pairs are pairs of video clips and their corresponding transcripts. We generate transcripts for hundreds of surgical video lectures using two ASR systems, i.e., AWS Medical Transcribe (AWS, 2023) and Whisper (Radford et al., 2023). The transcripts usually illustrate the essential concept of surgical anatomies, instruments and events. We use large-scale video clip-text pairs to learn joint multi-modal representations.

图1. 来自SVL数据集的视频片段-文本对示例。视频片段-文本对是视频片段与其对应的转录文本的组合。我们使用两个语音识别（ASR）系统，即AWS Medical Transcribe（AWS，2023）和Whisper（Radford等人，2023），为数百个手术视频讲座生成转录文本。这些转录文本通常阐释了手术解剖结构、器械和事件的核心概念。我们利用大规模的视频片段-文本对来学习联合多模态表征。

Fig. 2. Pipeline of proposed SurgVLP. Figure (a) shows examples of video clip-text pairs and their construction process. We have two text views and we pair them to random lengths of video clips. Figure (b) presents the contrastive learning objective with AWS sentences and Whisper sentences. SurgVLP utilizes the Info-NCE and MIL-NCE losses for AWS and Whisper sentences, respectively. Figure (c) illustrates how to perform downstream tasks in the zero-shot setting. We show the vision-and-language tasks, e.g., text-based video retrieval and temporal activity grounding, at the top and the vision-only tasks at the bottom.

图2. 所提出的SurgVLP的流程。图（a）展示了视频片段-文本对的示例及其构建过程。我们有两个文本视图，并将它们与随机长度的视频片段配对。图（b）呈现了针对AWS句子和Whisper句子的对比学习目标。SurgVLP分别对AWS句子和Whisper句子采用Info-NCE损失和MIL-NCE损失。图（c）说明了如何在零样本设置下执行下游任务。上方展示了视觉-语言任务（如基于文本的视频检索和时序活动定位），下方展示了纯视觉任务。

Fig. 3. Text-only-training for video captioning: We use the learned joint embedding space where text is encoded in a representation close to the ones of its corresponding video clips. During training, we train the text decoder to generate captions from text embeddings. During inference, the visual embeddings are fed to the visual encoder and then to the text decoder to generate the text captions.

图3. 视频字幕生成的纯文本训练：我们利用学习到的联合嵌入空间，其中文本的编码表示与其对应的视频片段的编码表示相近。在训练阶段，我们训练文本解码器从文本嵌入生成字幕。在推理阶段，视觉嵌入被输入到视觉编码器，然后再输入到文本解码器以生成文本字幕。

Fig. 4. Qualitative results of text-based video retrieval on SVL-Retrieval dataset using SurgVLP’s learned joint multi-modal representations. For each language query, we retrieve 3 video clips from the repository. The ground truth video clip is framed in green. It is here always mentioned in the top-3 results.

图4. 利用SurgVLP学习到的联合多模态表征在SVL-Retrieval数据集上进行基于文本的视频检索的定性结果。对于每个语言查询，我们从库中检索出3个视频片段。真值视频片段用绿色边框标注，且在检索结果的前三名中均有出现。

Fig. 5. Textual-visual activation maps from different sentence queries. The first row shows the ground truth. The second row shows the predicted activation map along the time axis for the raw sentence. The third row shows the newly generated activation maps conditioned by modified sentences. When the whole sentence is decomposed into sub-sentences, the SurgVLP approach generates a focused textual-visual activation map for the sentence with clear and less ambiguous words. This shows that SurgVLP responds to specific surgical terms rather than general terminology.

图5. 不同句子查询的文本-视觉激活图。第一行展示真值。第二行展示针对原始句子沿时间轴的预测激活图。第三行展示由修改后的句子生成的新激活图。当整个句子被分解为子句时，SurgVLP方法会为包含清晰且歧义较少词汇的句子生成聚焦的文本-视觉激活图。这表明SurgVLP对特定外科术语有响应，而非通用术语。

Fig. 6. Textual-visual activation maps of the SurgVLP model, computed on two language queries from SVL-Retrieval testing set. The language queries are shown at the top of the figure, and the first row shows the ground truth activation map. The second and the third row shows the activation maps of SurgVLP trained with one text view, i.e., AWS texts and Whisper texts, respectively. The last row shows that when the SurgVLP model is trained on both AWS and Whisper texts, it generates more concrete activation maps with less noise

图6. SurgVLP模型的文本-视觉激活图，基于SVL-Retrieval测试集中的两个语言查询计算得出。语言查询显示在图的顶部，第一行展示真值激活图。第二行和第三行分别展示使用单一文本视图（即AWS文本和Whisper文本）训练的SurgVLP的激活图。最后一行显示，当SurgVLP模型在AWS和Whisper两种文本上训练时，会生成更具体、噪声更少的激活图。

Fig. 7. Qualitative results of temporal activity grounding. We show the grounding results of two videos with three language queries. Each set of images represents a video clip. We show top-2 grounded clips for given text queries. Video clips framed in green are the ground truth to the given text. #1: top-1 grounded result. #2: top-2 grounded result.

图7. 时序活动定位的定性结果。我们展示了两个视频在三个语言查询下的定位结果。每组图像代表一个视频片段。对于给定的文本查询，我们展示了排名前2的定位片段。绿色边框标注的视频片段是给定文本对应的真值。#1：排名第1的定位结果；#2：排名第2的定位结果。

Fig. 8. Caption results from text-only training for video captioning. Random: randomly initialized SurgVLP. CLIP (Radford et al., 2021): publicly available joint embedding space from OpenAI pre-trained CLIP model. SurgVLP shows more reliable captioning results with more overlap to the ground truth sentence. Also, the SurgVLP approach can generate detailed captions with the surgical instrument mentioned, e.g. ‘‘pledgets’’ in the top row last column.

图8. 基于纯文本训练的视频字幕生成结果。“Random”（随机）：随机初始化的SurgVLP。CLIP（Radford等人，2021）：来自OpenAI预训练CLIP模型的公开可用联合嵌入空间。SurgVLP的字幕生成结果更可靠，与真值句子的重叠度更高。此外，SurgVLP方法能够生成包含所提及手术器械的详细字幕，例如第一行最后一列中的“pledgets（小拭子）”。

Fig. 9. Effect of our designed contextual prompts to the zero-shot transfer of vision-only downstream tasks. Our contextual prompts outperform their counterparts by encoding more specific action and anatomy information, thus boosting phase recognition and instrument-verb recognition.

图9. 我们设计的上下文提示对纯视觉下游任务零样本迁移的影响。我们的上下文提示通过编码更具体的动作和解剖结构信息，性能优于其他同类提示，从而提升了阶段识别和器械-动词识别的效果。

Fig. 10. Text architecture selection. We calculate the cosine similarity score between the transcript texts from ASR and pre-segment texts from metadata to measure which text encoder retains the semantic information between these two texts.

图10. 文本架构的选择。我们计算来自自动语音识别（ASR）的转录文本与来自元数据的预分段文本之间的余弦相似度得分，以此衡量哪种文本编码器保留了这两类文本之间的语义信息。

Table

表

Table 1 Comparison of transcriptions generated by AWS and Whisper ASR systems.

Table 2 Manually designed contextual prompts for the class names of the surgical phase and tool recognition tasks. The main action of scissors is cutting, but this action can be performed by many other instruments, such as hook. Therefore, we use ‘‘I use scissors’’ as the context prompt for the ‘‘Scissors’’ class.

表2 为手术阶段和工具识别任务的类别名称手动设计的上下文提示。剪刀的主要动作是切割，但这一动作也可由许多其他器械完成，例如钩子。因此，我们使用“我使用剪刀”作为“剪刀”类别的上下文提示。

Table 3 Comparison of different datasets in this work. Human: if the dataset requires intervention by human annotators. SVL-Caption and SVL-Retrieval require partial intervention because texts are not annotated from scratch by human annotators.

表3 本研究中不同数据集的对比。“人工标注”指该数据集是否需要人工标注者参与。SVL-Caption（SVL字幕数据集）和SVL-Retrieval（SVL检索数据集）需要部分人工干预，因为其文本并非由人工标注者从头标注生成。

Table 4 Ablation studies. We conduct three sets of experiments to demonstrate the effect of key designs in our approach, multiple text views, clips of random lengths, and frame sampling from video clip. {???? , ???? } ?? ??=1 : model trained with one AWS text view; {???? , ???? ?? } ?? ??=1 : model trained with one Whisper text view; {???? , ???? , ???? ?? } ?? ??=1 : model trained with both text views. Random: Selecting a video clip with a duration randomly chosen from the range of 2 to 10 s.

表4 消融研究。我们通过三组实验来验证本方法中关键设计的效果，包括多文本视图、随机长度的片段以及从视频片段中进行帧采样。{???? , ???? } ?? ??=1：使用单一AWS文本视图训练的模型；{???? , ???? ?? } ?? ??=1：使用单一Whisper文本视图训练的模型；{???? , ???? , ???? ?? } ?? ??=1：使用两种文本视图训练的模型。“随机”指：选择时长在2到10秒范围内随机选取的视频片段。

Table 5 Comparison of different methods in text-based video retrieval and temporal activity grounding tasks.

表5 不同方法在基于文本的视频检索和时序活动定位任务中的对比

Table 6 SVL-Retrieval dataset. We show the categorical tags of the videos in the SVL-Retrieval testing set. Each video can belong to multiple categories, reflecting the diverse range of surgical procedures included in the testing set

表6 SVL检索数据集。我们展示了SVL检索测试集中视频的分类标签。每个视频可属于多个类别，这体现了测试集中所包含的外科手术流程的多样性。

Table 7 Quantitative results of text-only training for video captioning. We report 6 conventional metrics to measure the similarity between generated text and ground text. Our proposed SurgVLP significantly outperforms previous work, especially for ROUGE, which requires an accurate representation of not only individual words but also their correct order.

表7 基于纯文本训练的视频字幕生成定量结果。我们采用6项常规指标来衡量生成文本与真实文本之间的相似度。我们提出的SurgVLP显著优于先前的研究成果，尤其在ROUGE指标上表现突出——该指标不仅要求对单个词汇有准确的表征，还要求词汇的顺序正确无误。

Table 8 Zero-shot tool recognition on Cholec80. T1: grasper; T2: bipolar; T3: hook; T4: scissor; T5: clipper; T6: irrigator; T7: specimen bag. Fullysupervised: ResNet50 model with full supervision.

表8 Cholec80数据集上的零样本工具识别结果。T1：抓钳；T2：双极电凝器；T3：钩子；T4：剪刀；T5：夹钳；T6：冲洗器；T7：标本袋。全监督：采用全监督方式训练的ResNet50模型。

Table 9 Zero-shot phase recognition on Cholec80. P1: preparation; P2: calot triangle dissection; P3: clipping and cutting; P4: gallbladder dissection; P5: gallbladder packing; P6: cleaning and coagulation; P7: gallbladder extraction. F1-Score is used as the evaluation metric. Fully-supervised: ResNet50 model with full supervision.

表9 Cholec80数据集上的零样本阶段识别结果。P1：准备阶段；P2：胆囊三角解剖阶段；P3：夹闭与切割阶段；P4：胆囊解剖阶段；P5：胆囊包裹阶段；P6：清理与凝固阶段；P7：胆囊取出阶段。评估指标为F1分数。全监督：采用全监督训练的ResNet50模型。

Table 10 Zero-shot triplet recognition results. We report the average precision for each component and the combination of the components. i: instrument, v: verb, t: target, iv: instrument-verb, it: instrument-target, ivt: instrument-verb-target triplet.

表10 零样本三元组识别结果。我们报告了每个组成部分以及各组成部分组合的平均精度。i：器械；v：动词；t：目标；iv：器械-动词；it：器械-目标；ivt：器械-动词-目标三元组。