论文翻译:Falcon: A Remote Sensing Vision-Language Foundation Model
论文地址:https://arxiv.org/abs/2503.11070
0 Abstract
英文原文 | 中文翻译 |
---|---|
This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning,and etc. To facilitate Falcon’s training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community. | 本文介绍了一个专为遥感领域量身定制的整体视觉语言基础模型——Falcon。Falcon 提供了一个统一的、基于提示的范式,能够高效地执行全面而复杂的遥感任务。Falcon 在图像、区域和像素级别展现出强大的理解和推理能力。具体而言,只需简单的自然语言指令和遥感图像,Falcon 就能在 14 个不同的任务(例如图像分类、目标检测、分割、图像字幕等)中以文本形式生成令人印象深刻的结果。为了促进 Falcon 的训练并增强其表征能力以编码丰富的空间和语义信息,我们开发了 Falcon_SFT,这是一个大规模、多任务、指令可调的遥感数据集。Falcon_SFT 数据集包含约 7800 万个高质量数据样本,涵盖 560 万幅多空间分辨率、多视角的遥感图像,并包含多种指令。该数据集采用分层标注,并经过人工采样验证,以确保数据的高质量和可靠性。大量的对比实验表明,尽管参数规模仅有7亿,Falcon 在 67 个数据集和 14 个任务上仍取得了优异的表现。我们将完整的数据集、代码和模型权重发布到 https://github.com/TianHuiLab/Falcon,希望能够助力开源社区的进一步发展。 |
1. Introduction
英文原文 | 中文翻译 |
---|---|
Large vision language models(LVLMs) have demonstrated remarkable success in various vision-language tasks on natural images[1,12,38,55,101]. However, due to the significant domain and embedded knowledge gap between the natural images and remote sensing images, developing a remote sensing foundational vision-language model remains a substantial challenge. To this end, previous studies[21,27,37,51,96] usually focused on learning vision-language models that excel in specific remote sensing tasks, limiting their adaptability for more diverse and complex scenarios. With the ongoing advancement of Artificial General Intelligence(AGI) systems, creating a foundational remote sensing model with comprehensive understanding and reasoning capabilities is of significant value. However, attaining such a foundational remote sensing model still faces significant challenges, which we summarize as follows: 1) Existing models did not feature a universal representation for diverse remote sensing tasks, often failing to facilitate the learning of comprehensive perceptual and reasoning abilities; 2) The absence of a large-scale,high-quality, multi-task dataset for training also limits the ability of current remote sensing models to learn robust and generalized representations. | 大型视觉语言模型 (LVLM) 在自然图像的各种视觉语言任务中表现出色 [1,12,38,55,101]。然而,由于自然图像和遥感图像之间存在巨大的领域和嵌入式知识差距,开发遥感基础视觉语言模型仍然是一项艰巨的挑战。为此,先前的研究 [21,27,37,51,96] 通常侧重于学习在特定遥感任务中表现优异的视觉语言模型,限制了它们对更多样化和更复杂场景的适应性。随着通用人工智能 (AGI) 系统的不断发展,创建具有全面理解和推理能力的基础遥感模型具有重要意义。然而,实现这样的基础遥感模型仍然面临着重大挑战,我们总结如下:1)现有模型对各种遥感任务缺乏通用的表示,往往无法促进学习全面的感知和推理能力; 2)缺乏大规模、高质量、多任务的训练数据集也限制了当前遥感模型学习稳健和广义表示的能力。 |
To address the above challenges, we first propose Falcon, a versatile vision-language foundation model with comprehensive perceptual and reasoning abilities tailored for remote sensing. In particular, Falcon features a unified architecture for multitask learning, bridging image-level, region-level, and pixel-level reasoning and understanding abilities in one model. To the best of our knowledge, Falcon is the first remote sensing VLM capable of performing 14 diverse understanding and reasoning tasks across image, region, and pixel levels simultaneously. We hereby provide an ability comparison among various remote sensing VLMs and Falcon in Tab. 1. Compared with Falcon, previous models like GeoChat[27] and RSGPT[21] can only support a limited scope of remote sensing tasks, narrowing their application scenarios. The crucial challenge for designing Falcon is learning universal representation for diverse remote sensing tasks. Inspired by the latest research in natural image area[74, 77,81,91], we utilize a unified network architecture to seamlessly integrate spatial hierarchy and semantic granularity information into a universal representation. The architecture consists of an image encoder and a multi-modality encoder-decoder. This design aligns the vision and language representations, and offers a unified framework to various remote sensing tasks without additional module designs. Besides,to further enhance the instruction understanding capability of Falcon, we propose a dynamic prompt training strategy that leverages multiple differently phrased versions of each instruction. In this way, given user’s prompts and remote sensing images, Falcon can produce results in a unified textual form across a wide range of tasks, e.g., image classification,object detection,segmentation,image captioning,change detection, and etc. | 为了应对上述挑战,我们首先提出了 Falcon,这是一个多功能的视觉语言基础模型,具有针对遥感应用的全面感知和推理能力。Falcon 具有统一的多任务学习架构,将图像级、区域级和像素级的推理和理解能力集成到一个模型中。据我们所知,Falcon 是第一个能够同时在图像、区域和像素级执行 14 种不同理解和推理任务的遥感 VLM。我们在表 1 中提供了各种遥感 VLM 与 Falcon 的能力比较。与 Falcon 相比,GeoChat[27] 和 RSGPT[21] 等先前的模型仅支持有限范围的遥感任务,从而限制了它们的应用场景。设计 Falcon 的关键挑战是学习针对各种遥感任务的通用表示。受自然图像领域最新研究[74, 77,81,91]的启发,我们利用统一的网络架构,将空间层次和语义粒度信息无缝集成到一个通用表示中。该架构由一个图像编码器和一个多模态编解码器组成。这种设计将视觉和语言表征统一起来,并为各种遥感任务提供了一个统一的框架,无需额外的模块设计。此外,为了进一步增强Falcon的指令理解能力,我们提出了一种动态提示训练策略,该策略利用每个指令的多个不同措辞版本。通过这种方式,根据用户的提示和遥感图像,Falcon可以在图像分类、目标检测、分割、图像字幕、变化检测等各种任务中以统一的文本形式生成结果。 |
Moreover, to facilitate Falcon’s training, we further develop Falcon_SFT, a large-scale, multi-task instruction-tuning dataset. Early remote sensing datasets[14, 43, 80]usually focused on a single or a few vision tasks. Recent studies proposed multimodal remote sensing datasets suitable for training vision-language models. However, these datasets often contain a limited number of image-text pairs,making them only useful for training models on specific tasks[21, 89, 96]. Therefore, we present Falcon_SFT, a large-scale multi-task instruction-tuning dataset. The Fal-con_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial res-olution and multi-view remote sensing images. Specifically,we uniformly standardize each sample in the Falcon_SFT dataset into a unified format, facilitating the training of our proposed Falcon. Please see Fig. 3 for data examples. | 此外,为了促进Falcon的训练,我们进一步开发了Falcon_SFT,这是一个大规模、多任务的指令调优数据集。早期的遥感数据集[14, 43, 80]通常侧重于单个或少数几个视觉任务。近期研究提出了适合训练视觉-语言模型的多模态遥感数据集。然而,这些数据集通常包含有限数量的图文对,使其仅适用于训练特定任务的模型[21, 89, 96]。因此,我们提出了Falcon_SFT,这是一个大规模、多任务的指令调优数据集。Falcon_SFT数据集包含约7800万个高质量数据样本,涵盖560万幅多空间分辨率和多视角遥感影像。具体而言,我们将Falcon_SFT数据集中的每个样本统一标准化为统一格式,以方便我们提出的Falcon的训练。数据示例请参见图3。 |
In experiments, we conduct a variety of evaluations of our proposed Falcon both qualitatively and quantitatively(see Fig. 1 for a quick preview). For qualitative evaluations,we visualize the prediction results of 14 tasks individually and compare with other state-of-the-art methods, in order to evaluate the performance of Falcon. For quantitative evalu-ations, we assess the performance of Falcon on each down-stream task, along with its zero-shot performance on unseen data samples, highlighting the generalization ability of Fal-con. Beside, we conduct detailed ablation studies for Fal-con, showcasing the effectiveness of our training recipes. | 在实验中,我们对我们提出的 Falcon 进行了一系列定性和定量评估(参见图 1 快速预览)。在定性评估中,我们分别可视化了 14 项任务的预测结果,并与其他最先进的方法进行比较,以评估 Falcon 的性能。在定量评估中,我们评估了 Falcon 在每个下游任务上的表现,以及它在未见过的数据样本上的零样本性能,以突出 Falcon 的泛化能力。此外,我们还对 Falcon 进行了详细的消融研究,展示了我们训练方案的有效性。 |
Finally, to address the critical absence of a highperformance foundational model for remote sensing in the community, we will fully open-source our work with complete dataset, code and model weights, aiming to bridge the gap between foundational models for remote sensing imagery and foundational models for natural imagery. Despite the substantial financial investment of our proposed Falcon, we hope this effort will foster further research and development in the field, advancing the capabilities of remote sensing models and their real-world applications. | 最后,为了解决社区中高性能遥感基础模型严重缺失的问题,我们将完全开源我们的工作,包括完整的数据集、代码和模型权重,旨在弥合遥感影像基础模型与自然影像基础模型之间的差距。尽管我们提出的 Falcon 项目投入了大量资金,但我们希望这项工作能够促进该领域的进一步研究和开发,提升遥感模型的能力及其实际应用。 |
Contributions of this paper can be summarized as follows. 1) To the best of our knowledge, Falcon is the first remote sensing vision-language model to feature image, region, and pixel-level understanding and reasoning capabilities, supporting 14 tasks within a unified architecture. 2) As of March 2025, Falcon SFT stands as the largest and most comprehensive dataset for training vision-language models in the remote sensing field. 3) We have conducted extensive experiments to demonstrate the superiority of Falcon over previous VLMs, highlighting the effectiveness of Falcon and Falcon SFT in the field of remote sensing. The complete dataset, code, and model weights will be fully open-sourced to the community. | 本文的贡献可以概括如下:1)据我们所知,Falcon 是第一个具备图像、区域和像素级理解与推理能力的遥感视觉语言模型,在统一架构下支持 14 项任务。2)截至 2025 年 3 月,Falcon SFT 是遥感领域训练视觉语言模型的最大、最全面的数据集。3)我们进行了大量的实验,证明了 Falcon 相较于以往 VLM 的优势,凸显了 Falcon 和 Falcon SFT 在遥感领域的有效性。完整的数据集、代码和模型权重将完全开源给社区。 |
Figure 1. An overall performance comparison between Falcon and 10 state-of-the-art models across 14 remote sensing tasks at image, region, and pixel levels. Results demonstrate that Falcon outperformed existing models, showcasing superior and more comprehensive understanding and reasoning capabilities.
图 1. Falcon 与 10 个最先进模型在 14 个遥感任务(图像、区域和像素级别)上的整体性能比较。结果表明,Falcon 的表现优于现有模型,展现出更卓越、更全面的理解和推理能力。
Table 1. Comparisons of capabilities of different remote sensing vision-language models. Several representative models have been included in this table. Notably, Falcon exhibits the most comprehensive understanding and reasoning capabilities, covering image, region, and pixel levels comprehensively. For task abbreviations in the second row, please see Fig. 3 for details.
表1. 不同遥感视觉语言模型能力比较。本表收录了多个代表性模型。值得一提的是,Falcon展现了最全面的理解和推理能力,全面覆盖了图像、区域和像素级别。第二行中的任务缩写详见图3。
Table 2. Comparisons with VLMs’ remote sensing datasets.
表 2. 与 VLM 遥感数据集的比较。
Figure 3. An illustrative example of images, their corresponding instructions, and output format of different tasks in Falcon SFT dataset.
图 3. Falcon SFT 数据集中不同任务的图像、其对应指令和输出格式的说明性示例。
2. Related Work
2.1. Remote sensing datasets
英文原文 | 中文翻译 |
---|---|
The development of high-quality remote sensing datasets has attracted increasing attention in recent years. Previous studies on this field mainly focused on two perspectives. Some studies [14, 34, 67, 80] focused on image datasets each targeting a single or a few vision tasks. Long et al. [43] proposed Million-AID, a large-scale image dataset containing 51 categories and a million instances for remote sensing scene classification. G. Sumbul et al. [65] introduced BigEarthNet, comprising 590,326 images collected from Sentinel-1 and Sentinel-2 satellites, featuring several resolutions and image sizes. The DOTA series datasets [14, 80] were mainly sourced from Google Earth, the GF-2 satellite, and aerial images, which have greatly advanced the field of object detection. The latest version [14] featured 11,268 images, 18 categories, and an extensive set of annotations with oriented bounding boxes. Jacob Shermeyer et al. [61] proposed RarePlanes dataset in order to improve the performance of detecting aircraft and their attributes in satellite imagery. GID [69], UAVid[45], DLRSD[59] were commonly used datasets for the semantic segmentation task of RGB remote sensing images. | 近年来,高质量遥感数据集的开发越来越受到人们的关注。先前对该领域的研究主要集中在两个方面。一些研究 [14, 34, 67, 80] 侧重于针对单个或几个视觉任务的图像数据集。Long 等人 [43] 提出了 Million-AID,这是一个包含 51 个类别和一百万个实例的大规模图像数据集,用于遥感场景分类。G. Sumbul 等人 [65] 推出了 BigEarthNet,包含从 Sentinel-1 和 Sentinel-2 卫星收集的 590,326 幅图像,具有多种分辨率和图像尺寸。DOTA 系列数据集 [14, 80] 主要来源于谷歌地球、GF-2 卫星和航空图像,这些数据集极大地推动了物体检测领域的发展。最新版本 [14] 包含 11,268 幅图像、18 个类别和一组带有定向边界框的大量注释。Jacob Shermeyer 等人 [65] 提出了 BigEarthNet,其中包含从 Sentinel-1 和 Sentinel-2 卫星收集的 590,326 幅图像,具有多种分辨率和图像尺寸。DOTA 系列数据集 [14, 80] 主要来源于谷歌地球、GF-2 卫星和航空图像,这些数据集极大地推动了物体检测领域的发展。最新版本 [14] 包含 11,268 幅图像、18 个类别和一组带有定向边界框的大量注释。 [61] 提出了 RarePlanes 数据集,以提高在卫星图像中检测飞机及其属性的性能。GID[69]、UAVid[45] 和 DLRSD[59] 是 RGB 遥感图像语义分割任务中常用的数据集。 |
Besides, several studies [40, 44, 86, 88] have developed multimodal datasets to support vision-language models in remote sensing. Dilxat Muhtar et al. [51] developed LHRS-Align, which included 0.9K samples for visual reasoning, 4K samples for detailed image descriptions, and 7K samples for conversational tasks. However, to use this dataset, users must download the original images from Google Earth imagery. RSICD [44], Sydney-Captions [54], UCM-Captions [54], NWPU-Captions [9] were datasets specifically created for remote sensing image caption generation tasks, containing 10921, 613, 2000, 31500 images, each accompanied by descriptions of varying lengths. | 此外,一些研究[40, 44, 86, 88]开发了多模态数据集,以支持遥感领域的视觉语言模型。Dilxat Muhtar等人[51]开发了LHRS-Align数据集,其中包含0.9K个用于视觉推理的样本、4K个用于详细图像描述的样本以及7K个用于对话任务的样本。然而,要使用此数据集,用户必须从谷歌地球影像中下载原始图像。RSICD[44]、Sydney-Captions[54]、UCM-Captions[54]和NWPU-Captions[9]是专门为遥感图像标题生成任务创建的数据集,包含10921、613、2000、31500幅图像,每幅图像都附有长度不一的标题。 |
Despite previous advancements, existing remote sensing datasets remained limited in terms of data scale, task diversity, hierarchical annotation, and annotation quality. The field still lacked a large-scale, multi-task dataset suitable for training foundational vision-language models, hindering their progress. To address this challenge, we present Falcon SFT in this paper, a comprehensive, large-scale, multitask instruction-tuning dataset for remote sensing. Specifically, we compiled 67 remote sensing datasets covering a variety of tasks, please refer to supplementary material for details. | 尽管先前取得了一些进展,但现有的遥感数据集在数据规模、任务多样性、层次化标注和标注质量方面仍然有限。该领域仍然缺乏适合训练基础视觉语言模型的大规模多任务数据集,阻碍了其发展。为了应对这一挑战,我们在本文中提出了 Falcon SFT,这是一个全面的、大规模的、多任务的遥感指令调优数据集。具体而言,我们整理了 67 个涵盖各种任务的遥感数据集,详情请参阅补充材料。 |
2.2. Remote Sensing Foundation Models
英文原文 | 中文翻译 |
---|---|
Recently, a considerable literature has grown up around the theme of developing remote sensing foundation models. These pre-trained foundation models can be categorized based on architectural design. The first category consists of ViT-based vision foundation models [35, 48, 50, 56]. For instance, Sun et al. proposed RingMo [66], a classic remote sensing vision model fine-tuning on 4 downstream tasks. These methods lacked reasoning abilities and cannot be controlled via natural language instructions. The second category includes CLIP-based vision-language models [37, 75, 96]. For instance, Liu et al. proposed RemoteCLIP [37], the first vision-language foundation model for remote sensing that aligned text embeddings for downstream application. However, these methods cannot perform different tasks without designing additional modules. The third category comprises LLM-based vision-language models [27, 51, 92, 93]. Zhan et al. proposed SkyEyeGPT [89], specifically designed for remote sensing images understanding. Kartik Kuckreja et al. [27] introduced GeoChat, a versatile LLaVA-based remote sensing vision-language model, but it cannot perform complex pixel-level tasks such as segmentation or change detection. Similarly, LHRS-Bot [51] also lacked such capabilities. Furthermore, these methods often exceeded 7 billion parameters, leading to computational bottlenecks and low inference efficiency when deployed on edge devices. More importantly, we believe that the LLMs module containing significant number of parameters may not play an essential role in remote sensing, considering that this task still primarily focuses on the visual input. Therefore, in this paper, we propose a lightweight vision-language model to efficiently handle various remote sensing tasks in a unified paradigm. | 近年来,围绕开发遥感基础模型的主题涌现出了大量文献。这些预训练的基础模型可以根据架构设计进行分类。第一类是基于 ViT 的视觉基础模型 [35, 48, 50, 56]。例如,Sun 等人提出了 RingMo [66],这是一个经典的遥感视觉模型,可针对 4 个下游任务进行微调。这些方法缺乏推理能力,无法通过自然语言指令进行控制。第二类是基于 CLIP 的视觉语言模型 [37, 75, 96]。例如,刘等人提出了 RemoteCLIP [37],这是第一个用于遥感的视觉语言基础模型,它可以为下游应用对齐文本嵌入。然而,如果不设计额外的模块,这些方法就无法执行不同的任务。第三类是基于 LLM 的视觉语言模型 [27, 51, 92, 93]。Zhan 等人提出了 SkyEyeGPT [89],专为遥感图像理解而设计。 Kartik Kuckreja 等人 [27] 提出了 GeoChat,一种基于 LLaVA 的多功能遥感视觉语言模型,但它无法执行复杂的像素级任务,例如分割或变化检测。同样,LHRS-Bot [51] 也缺乏这样的能力。此外,这些方法的参数通常超过 70 亿,部署在边缘设备上时会导致计算瓶颈和推理效率低下。更重要的是,我们认为,考虑到遥感任务仍然主要侧重于视觉输入,包含大量参数的 LLMs 模块可能并不在遥感中发挥关键作用。因此,本文提出了一种轻量级的视觉语言模型,以便在统一的范式中有效地处理各种遥感任务。 |
3. Algorithm
英文原文 | 中文翻译 |
---|---|
In this section, we aim to delve into the details of Falcon, introducing a simple yet effective way to address challenges of unifying many complex remote sensing tasks in one manner. Specifically, we will introduce the design of Falcon’s architecture and a multi-task learning paradigm, that enables the unification of various vision-language tasks. | 本节将深入探讨 Falcon 的细节,介绍一种简单有效的方法,以应对将众多复杂的遥感任务统一起来的挑战。具体来说,我们将介绍 Falcon 的架构设计和多任务学习范式,从而实现各种视觉语言任务的统一。 |
Notation:Let I∈RH×W×3\mathcal{I}\in R^{H\times W\times 3}I∈RH×W×3 denote the input remote sensing image, with H and W denoting the height and width of the image. T\mathcal{T}T denotes the input textual prompt. y denotes the prediction target i.e., the formulated visual annotations. G\mathcal{G}G denotes the image encoder. E\mathcal{E}E denotes the text token embedding function. F\mathcal{F}F denotes the standard encoder-decoder network of the transformer architecture. | 符号:设 I∈RH×W×3\mathcal{I}\in R^{H\times W\times 3}I∈RH×W×3 表示输入遥感图像,其中 H 和 W 表示图像的高度和宽度。T\mathcal{T}T 表示输入文本提示。y 表示预测目标,即公式化的视觉标注。G\mathcal{G}G 表示图像编码器。E\mathcal{E}E 表示文本标记嵌入函数。F\mathcal{F}F 表示 Transformer 架构的标准编码器-解码器网络。 |
3.1 Architecture Overview
英文原文 | 中文翻译 |
---|---|
In Falcon, we employ a sequence-to-sequence framework that is capable of putting all distinct tasks in a uniformed format. As depicted in Fig. 2: | 在 Falcon 中,我们采用了一个序列到序列的框架,该框架能够将所有不同的任务放入统一的格式。如图 2 所示: |
1. Visual Processing: Given a remote sensing image I\mathcal{I}I and a text prompt T\mathcal{T}T, we feed I\mathcal{I}I to image encoder G\mathcal{G}G to extract the visual token embedding V∈RNv×Dv\mathcal{V}\in R^{N_{v}\times D_{v}}V∈RNv×Dv, with NvN_{v}Nv and DvD_{v}Dv respectively represent the number and dimension of vision tokens. | 1. 视觉处理:给定一幅遥感图像 I\mathcal{I}I 和一段文本提示 T\mathcal{T}T,我们将 I\mathcal{I}I 馈送到图像编码器 G\mathcal{G}G 以提取视觉 token 嵌入 V∈RNv×Dv\mathcal{V}\in R^{N_{v}\times D_{v}}V∈RNv×Dv,其中 NvN_{v}Nv 和 DvD_{v}Dv 分别表示视觉 token 的数量和维度 |
2. Text Processing: Simultaneously, we leverage E\mathcal{E}E to process T\mathcal{T}T to obtain the text token embedding E(T)∈RNt×D\mathcal{E}(\mathcal{T})\in R^{N_{t}\times D}E(T)∈RNt×D. | 2.文本处理:同时,我们利用 E\mathcal{E}E 来处理 T\mathcal{T}T 以获得文本标记嵌入 E(T)∈RNt×D\mathcal{E}(\mathcal{T})\in R^{N_{t}\times D}E(T)∈RNt×D。 |
3. Multimodal Fusion: We combine the vision token embedding and the text token embedding to form a multi-modality embedding X=[V′,E(T)]\mathcal{X}=\left[\mathcal{V}^{\prime},\mathcal{E}(\mathcal{T})\right]X=[V′,E(T)], where V′∈RNt×D\mathcal{V}^{\prime}\in R^{N_{t}\times D}V′∈RNt×D is derived from V\mathcal{V}V through a visual adapter[81], serving as the task-agnostic input to F\mathcal{F}F. | 3. 多模态融合:我们将视觉标记嵌入和文本标记嵌入结合起来,形成一个多模态嵌入 X=[V′,E(T)]\mathcal{X}=\left[\mathcal{V}^{\prime},\mathcal{E}(\mathcal{T})\right]X=[V′,E(T)],其中 V′∈RNt×D\mathcal{V}^{\prime}\in R^{N_{t}\times D}V′∈RNt×D 是通过视觉适配器[81]从 V\mathcal{V}V 派生而来的,作为 F\mathcal{F}F 的任务无关输入。 |
Figure 2. The overview of Falcon model architecture. Given a single image or an image pair (for the task of change detection), Falcon can follow diverse multi-task instructions, generating a universal textual representation suitable for various remote sensing tasks. As shown in the figure, Falcon correctly distinguishes the category of the given image, provides the spatial bounding boxes/segmentations masks for the given objects and even detects subtle changes across images, highlighting its comprehensive capabilities for remote sensing.
图 2. Falcon 模型架构概览。给定单幅图像或图像对(用于变化检测任务),Falcon 可以执行多种多任务指令,生成适用于各种遥感任务的通用文本表示。如图所示,Falcon 能够正确区分给定图像的类别,为给定对象提供空间边界框/分割蒙版,甚至能够检测到图像间的细微变化,彰显了其全面的遥感能力。
3.2 Dynamic Prompt Training Strategy-
英文原文 | 中文翻译 |
---|---|
Unlike the previous studies[27, 81], we propose a dynamic prompt training strategy to eliminate the reliance on task-specific tokens. Specially: - Given a prompt T\mathcal{T}T, Falcon will dynamically sample several differently phrased versions {Ti′}i=1M\left\{\mathcal{T}_{i}^{\prime}\right\}_{i=1}^{M}{Ti′}i=1M from a predefined prompt pool to form the X={[V′,E(Ti′)]}i=1M\mathcal{X}=\left\{\left[\mathcal{V}^{\prime},\mathcal{E}\left(\mathcal{T}_{i}^{\prime}\right)\right]\right\}_{i=1}^{M}X={[V′,E(Ti′)]}i=1M to join the training process. - Note that {Ti′}i=1M\left\{\mathcal{T}_{i}^{\prime}\right\}_{i=1}^{M}{Ti′}i=1M and T\mathcal{T}T share similar semantic meanings. - This design further enhances Falcon’s understanding ability of natural language. | 与之前的研究[27, 81]不同,我们提出了一种动态提示训练策略,以消除对特定任务标记的依赖。具体来说: - 给定一个提示 T\mathcal{T}T,Falcon 会从预定义的提示池中动态采样几个不同措辞版本 {Ti′}i=1M\left\{\mathcal{T}_{i}^{\prime}\right\}_{i=1}^{M}{Ti′}i=1M,形成 X={[V′,E(Ti′)]}i=1M\mathcal{X}=\left\{\left[\mathcal{V}^{\prime},\mathcal{E}\left(\mathcal{T}_{i}^{\prime}\right)\right]\right\}_{i=1}^{M}X={[V′,E(Ti′)]}i=1M加入到训练过程中。 - 需要注意的是,{Ti′}i=1M\left\{\mathcal{T}_{i}^{\prime}\right\}_{i=1}^{M}{Ti′}i=1M 和 T\mathcal{T}T 具有相似的语义含义。 - 这种设计进一步增强了 Falcon 对自然语言的理解能力。 |
3.3 Unified Task Representation
英文原文 | 中文翻译 |
---|---|
To ensure the input and output of distinct tasks in a unified format, we treat each task as a sequence-to-sequence translation task. As shown in Fig. 3: - We regard images, prompts, annotations as special languages. - For example, an instruction of unified format for the region caption is as follows: "Describe the<region> in the image." , where <region> is <box><x1><y1><x2><y2></box> representing location tokens.- The location tokens are the coordinates of the bounding box. - We add location tokens to the tokenizer’s vocabulary list, representing quantized coordinates. - We create 1000 bins which represent regions using formats tailored to task requirements. | 为了确保不同任务的输入和输出格式统一,我们将每个任务视为一个序列到序列的翻译任务。如图 3 所示: - 我们将图像、提示和注释视为特殊语言。 - 例如,区域标题的统一格式指令如下: “描述图像中的<region>。” ,其中<region> 是表示位置标记的<box><x1><y1><x2><y2></box> 。- 位置标记是边界框的坐标。 - 我们将位置标记添加到标记器的词汇表中,表示量化的坐标。 - 我们创建了 1000 个表示区域的bins,其格式根据任务需求量身定制。 |
3.4 Loss Function (交叉熵损失函数)
We utilize the cross-entropy loss to optimize the Falcon for 14 tasks like normal large language models.
L=−∑i=1∣y∣∑x∈XlogPθ(yi∣y<i,x),(1)\mathcal{L}=-\sum_{i=1}^{|y|}\sum_{x\in\mathcal{X}}\log P_{\theta}\left(y_{i}\mid y_{<i}, x\right),\qquad(1)L=−i=1∑∣y∣x∈X∑logPθ(yi∣y<i,x),(1)
where:
- x∈Xx\in\mathcal{X}x∈X is the input vector consisting of the image embedding output by the image encoder and the prompt embedding
- y is the prediction target
- ∣y∣|y|∣y∣ is the number of target tokens
- θ\thetaθ is the Falcon’s parameter
4. Dataset
英文原文 | 中文翻译 |
---|---|
To equip Falcon with powerful image, region, and pixellevel understanding and reasoning capabilities, we introduce Falcon SFT, the first large-scale, multi-task remote sensing instruction-tuning dataset. It contains 78 million high-quality samples covering 5.6 million multi-resolution, multi-view remote sensing images. This section details its creation process, including data collection, preprocessing, and instruction generation. | 为了赋予 Falcon 强大的图像、区域和像素级理解和推理能力,我们推出了首个大规模多任务遥感指令调优数据集 Falcon SFT。它包含 7800 万个高质量样本,涵盖 560 万幅多分辨率、多视角遥感影像。本节详细介绍了其创建过程,包括数据收集、预处理和指令生成。 |
4.1. Data Collection and Preprocessing
英文原文 | 中文翻译 |
---|---|
Currently, no existing dataset can fully meet the training requirements of Falcon. To address this, we devised a simple and straightforward approach, i.e. curating and combining various open-source datasets in remote sensing filed. | 目前,尚无任何现有数据集能够完全满足 Falcon 的训练需求。为了解决这个问题,我们设计了一种简单直接的方法,即整理并整合遥感领域的各种开源数据集。 |
We collected 90 annotated task-specific RGB image datasets, such as Million-AID [43], RSICD [44], and DOTA [14, 80], encompassing nearly all publicly available datasets originating from satellites, airplanes, drones, etc. After manual screening, we refined the selection to 67 relevant datasets. The complete list is available in Sec. A of the supplementary material. Notably, we provide download links and metadata (image size, spatial resolution, and quantity) to help reduce data collection efforts for researchers. | 我们收集了 90 个带注释的任务专用 RGB 图像数据集,例如 Million-AID [43]、RSICD [44] 和 DOTA [14, 80],涵盖了几乎所有来自卫星、飞机、无人机等的公开数据集。经过人工筛选,我们将最终选择范围缩小至 67 个相关数据集。完整列表可在补充材料 A 部分找到。值得注意的是,我们提供了下载链接和元数据(图像大小、空间分辨率和数量),以帮助研究人员减少数据收集工作量。 |
Next, we integrate the 67 collected remote sensing datasets, by establishing a unified and consistent annotation format. This standardization is necessary because different datasets use varying annotation formats (e.g., polygons vs. mask images), which can complicate data integration. Besides, to broaden application scenarios, we repurpose existing data structures to generate additional annotations, expanding the number of supported tasks to 14. These tasks are categorized into three levels, namely,Image-level: Image Classification, Image VQA, Counting, Image Captioning, and Image Detailed Captioning; Regionlevel: Region Classification-HBB, Region ClassificationOBB, Region Detection-HBB, Region Detection-OBB, Visual Grounding, and Region Captioning; Pixel-level: Pixel Classification, Pixel Segmentation, and Change Detection. This categorization aligns with prior discussions in [77, 91]. For more detailed data collection and preprocessing procedures, please see Sec. A of the supplementary material. | 接下来,我们通过建立统一一致的注释格式,整合了收集到的 67 个遥感数据集。这种标准化是必要的,因为不同的数据集使用不同的注释格式(例如,多边形与掩模图像),这会使数据集成变得复杂。此外, 为了拓宽应用场景,我们重新利用现有的数据结构来生成更多注释,将支持的任务数量扩展到 14 个。这些任务分为三个级别:图像级:图像分类、图像 VQA、计数、图像字幕和图像详细字幕;区域级:区域分类-HBB、区域分类-OBB、区域检测-HBB、区域检测-OBB、视觉接地和区域字幕;像素级:像素分类、像素分割和变化检测。此分类与 [77, 91] 中的先前讨论一致。有关更详细的数据收集和预处理程序,请参阅补充材料 A 节。 |
4.2. Unified Instruction Generation
英文原文 | 中文翻译 |
---|---|
Next, we transform our integrated dataset into a multi-task instruction-tuning dataset for vision-language model training. We take the steps as follows. | 接下来,我们将集成的数据集转换为用于视觉语言模型训练的多任务指令调优数据集。具体步骤如下。 |
Define Instruction Templates. To facilitate the understanding and execution of specific tasks by VLMs, we design standardized instruction templates based on different remote sensing tasks. For examples, for the Object Detection Task, “Detect <class> in the image. Use Rotated bounding boxes. ” is given. The rotated bounding box is represented as <quad> <x1> <y1> <x2> <y2> <x3> <y3> <x4> <y4> </quad> , specifying the coordinates of the four vertices, each expressed in thousandths. Please see Fig. 3 for instruction examples of all 14 tasks. | 定义指令模板。为了方便 VLM 理解和执行特定任务,我们根据不同的遥感任务设计了标准化的指令模板。例如,对于目标检测任务,给出的指令模板是“检测图像中的 <class>。使用旋转的边界框。 ”。旋转后的边界框表示为 <quad> <x1> <y1> <x2> <y2> <x3> <y3> <x4> <y4> </quad> ,指定四个顶点的坐标,每个顶点以千分之一表示。所有 14 个任务的指令示例请参见图 3。 |
Generate Image Instruction Pairs. To create image instruction pairs based on the defined templates, we first iterate over the dataset and generate specific instruction for each image based on its task type (e.g., detection, segmentation). We then combine the generated instruction with corresponding image and annotations into a structured pair. This enables the model to learn diverse task responses using different instruction-based prompts. | 生成图像指令对。为了基于定义的模板创建图像指令对,我们首先迭代数据集,并根据每幅图像的任务类型(例如,检测、分割)为其生成特定的指令。然后,我们将生成的指令与相应的图像和注释组合成一个结构化的指令对。这使得模型能够使用不同的指令提示来学习不同的任务响应。 |
Generate the Multi-instruction Pool. To enhance language understanding and reduce reliance on task-specific tokens, we diversify instruction patterns for each task using an LLM [2]. It generates multiple variations of the same instruction with different complexity levels. For instance, “Describe the image.” is expanded into “Describe the contents of this image.”, “Analyze the image and explain its visual content.”, and “Can you identify what this image shows?”. This approach enriches textual diversity in training data, helping VLMs to improve performance across various tasks. Please see Sec. B of the supplementary material for multi-instruction examples. | 生成多指令池。为了增强语言理解并减少对特定任务 token 的依赖,我们使用 LLM [2] 为每个任务提供多样化的指令模式。它会生成同一条指令的多个不同复杂度的变体。例如,“描述图像”会扩展为“描述图像的内容”、“分析图像并解释其视觉内容”以及“你能识别出图像中的内容吗?”。这种方法丰富了训练数据的文本多样性,帮助 VLM 提升在各种任务上的性能。有关多指令示例,请参阅补充材料 B 节。 |
4.3. Falcon SFT Dataset
英文原文 | 中文翻译 |
---|---|
Following the above data processing steps, we finally constructed the large-scale remote sensing instruction-tuning dataset Falcon SFT. We compare Falcon SFT with various datasets used for remote sensing vision-language models in Tab. 2. The Falcon SFT dataset features the largest number of samples (78 million) and images (5.6 million), supporting the highest number of tasks (14). It is also more comprehensive by covering image, region, and pixel-level spatial hierarchies. For detailed statistics of Falcon SFT dataset, please see Tab. II in Sec. A of the supplementary material. | 经过上述数据处理步骤,我们最终构建了大规模遥感指令调优数据集 Falcon SFT。我们将 Falcon SFT 与表 2 中用于遥感视觉语言模型的各种数据集进行了比较。Falcon SFT 数据集拥有最多的样本(7800 万张)和图像(560 万张),支持的任务数量也最多(14 个)。此外,它还涵盖了图像、区域和像素级的空间层次结构,因此更加全面。有关 Falcon SFT 数据集的详细统计数据,请参阅补充材料 A 部分中的表 II。 |
5. Experiments
英文原文 | 中文翻译 |
---|---|
In this section, we present the experimental setup and results to evaluate Falcon’s performance, including: 1) both qualitative and quantitative performance evaluations on all14 complex remote sensing tasks; 2) zero-shot performance of Falcon compared with previous methods. The results demonstrate Falcon’s ability to handle complex visionlanguage tasks and highlight its strengths in image, region, and pixel-level understanding and reasoning. To point out, due to the page limit, we provide additional experimental results in the supplementary material, including - qualitative performance evaluations of all 14 tasks in Sec. E, - quantitative performance evaluations for for tasks not covered in the main paper in Sec. F, - qualitative performance evaluations on diversified instructions in Sec. G, - human evaluations on image captioning performance in Sec. H, - more ablation studies in Sec. I and - the details of evaluation metrics for each task in Sec. J. | 在本节中,我们介绍了评估 Falcon 性能的实验设置和结果,包括: 1)对所有 14 项复杂遥感任务进行定性和定量性能评估; 2)与以前的方法相比,Falcon 的零样本性能。 结果证明了 Falcon 处理复杂视觉语言任务的能力,并突出了其在图像、区域和像素级理解和推理方面的优势。需要指出的是,由于页数限制,我们在补充材料中提供了额外的实验结果,包括 - E 节中所有 14 项任务的定性性能评估、 - F 节中主要论文未涉及的任务的定量性能评估、 - G 节中多样化指令的定性性能评估、 - H 节中对图像字幕性能的人工评估、 - I 节中的更多消融研究以及 - J 节中每个任务的评估指标的详细信息。 |
Implementation Details. Falcon consists of an image encoder and a transformer-based encoder-decoder, with a total of 0.7B parameters. The detailed architecture is illustrated in Fig. 2. We initialized the model’s parameters using the pre-trained weights provided by [81]. Unlike [81], we increased the output token length to 4096 in order to obtain more detailed representations. The training batch size for Falcon was 640, the learning rate was set to 1e−5, and the image size is 448 × 448. We trained the model for 4 days using 160 Nvidia A100 GPUs. | 实现细节。Falcon 由一个图像编码器和一个基于 Transformer 的编解码器组成,总共包含 0.7 亿个参数。其详细架构如图 2 所示。我们使用 [81] 提供的预训练权重初始化模型参数。与 [81] 不同,我们将输出 token 长度增加到 4096,以获得更详细的表示。Falcon 的训练批次大小为 640,学习率设置为 1e-5,图像大小为 448×448。我们使用 160 块 Nvidia A100 GPU 对该模型进行了 4 天的训练。 |
5.1. Performance Evaluation across 14 tasks
英文原文 | 中文翻译 |
---|---|
Image-level Tasks. In this section, we presented the performance of Falcon over image classification tasks (c.f.Tab. 3), counting tasks (c.f. Tab. 4) and VQA tasks (c.f.Tab. 5). As shown in Tab. 3, generic VLMs, such as MiniGPTv2 [101] and Qwen chat [3] encountered obstacles in performing effectively on remote sensing data, since they usually lacked the expert knowledge of this domain. Meanwhile, compared with VLMs specialized in remote sensing [27, 37, 51], Falcon achieved better performance in all related datasets, with only 0.7B parameters. Besides, we also provided detailed performance comparison of counting targets in Tab. 4. Such a task requires compositional perception and reasoning capabilities, presenting significant challenges to state-of-the-art VLMs. To this end, Falcon achieved superior performance in targets counting, showcasing its sophisticated capabilities. Finally, we compared Falcon with previous VLMs in VQA tasks, which these models usually excelled in. As shown in Tab. 5, Falcon still surpassed previous VLMs with less model parameters, indicating its strong instruction following capabilities. | 图像级任务。在本节中,我们展示了 Falcon 在图像分类任务(参见表 3)、计数任务(参见表 4)和 VQA 任务(参见表 5)中的表现。如表 3 所示,诸如 MiniGPTv2 [101] 和 Qwen chat [3] 等通用 VLM 在遥感数据上表现不佳,因为它们通常缺乏该领域的专业知识。同时,与专门用于遥感数据的 VLM [27, 37, 51] 相比,Falcon 在所有相关数据集上都取得了更佳的性能,且参数量仅为 0.7B。此外,我们还在表 4 中提供了计数目标的详细性能比较。此类任务需要组合感知和推理能力,对最先进的 VLM 提出了重大挑战。为此,Falcon 在目标计数方面取得了卓越的性能,展示了其精湛的实力。最后,我们将 Falcon 与之前的 VLM 在 VQA 任务中进行了比较,这些模型通常在这些任务中表现优异。如表 5 所示,Falcon 仍然以更少的模型参数超越了之前的 VLM,表明其强大的指令跟踪能力。 |
For image captioning tasks, we conduct human evaluations for Falcon’s responses. Specifically, captions were evaluated across three dimensions: detail, position, and hallucination, using a four-level rating system (i.e., A, B, C, D quantified as 4 to 1 points, where a higher point represents a better caption). The results in Tab. 6 showed that Falcon achieved the highest average scores across all three dimensions, compared with other VLMs. Please see Sec. H of the supplementary material for detailed experimental setup. | 对于图像字幕任务,我们对 Falcon 的响应进行了人工评估。具体来说,我们使用四级评分系统(即 A、B、C、D,每级 4 比 1 分,分数越高,字幕质量越好)对字幕进行了三个维度的评估:细节、位置和幻觉。表 6 中的结果表明,与其他 VLM 相比,Falcon 在这三个维度上均获得了最高的平均分数。有关详细的实验设置,请参阅补充材料 H 节。 |
Region-level Tasks. Beyond image-level tasks, our Falcon also support fine-grained region-level tasks. To this end, we present the performance of Falcon on object detection (horizontal bounding box) in Tab. 7. It is noticeable that previous VLMs demonstrated limited performance in this task, exposing their limitations in localization capabilities. In contrast, Falcon outperformed previous methods, highlighting its ability to handle complex remote sensing tasks. | 区域级任务。除了图像级任务外,我们的 Falcon 还支持细粒度的区域级任务。为此,我们在表 7 中展示了 Falcon 在目标检测(水平边界框)方面的表现。值得注意的是,之前的 VLM 在该任务中表现有限,暴露出其在定位能力方面的局限性。相比之下,Falcon 的表现优于之前的方法,凸显了其处理复杂遥感任务的能力。 |
Pixel-level Tasks. Besides, we also present the evaluation results of Falcon on pixel-level tasks. To the best of our knowledge, Falcon is the first VLM capable of showing satisfactory performance on pixel-level tasks, such as segmentation and change detection. The qualitative results of Falcon are shown in Fig. 4. Falcon successfully segmented designated complex targets in images based on prompts and also identified changes between two similar images. | 像素级任务。此外,我们还展示了 Falcon 在像素级任务上的评估结果。据我们所知,Falcon 是第一个能够在像素级任务(例如分割和变化检测)上表现令人满意的 VLM。Falcon 的定性结果如图 4 所示。Falcon 成功地根据提示分割出图像中指定的复杂目标,并识别出两幅相似图像之间的变化。 |
5.2. Zero-shot Evaluation
英文原文 | 中文翻译 |
---|---|
Finally, we evaluate the capabilities of Falcon in terms of zero-shot evaluations. We present the detailed performance comparison in Tab. 8, where these evaluation datasets were not used during training. Compared with previous VLMs, Falcon achieved performance improvements over all three levels of tasks. For image-level tasks, Falcon established a new record on many datasets, such as UCM-Captions and MAR20 for image captioning and image counting. For region-level tasks and pixel-level tasks, Falcon demonstrated exceptional performance on many datasets, which required comprehensive localization and reasoning capabilities. In contrast, such capabilities were commonly missing or even not supported in prior VLMs. | 最后,我们评估了 Falcon 在零样本评估方面的能力。表 8 展示了详细的性能比较,其中这些评估数据集并未在训练过程中使用。与之前的 VLM 相比,Falcon 在三个级别的任务上都取得了性能提升。对于图像级任务,Falcon 在多个数据集上创造了新纪录,例如在 UCM-Captions 和 MAR20 数据集上,分别在图像字幕和图像计数方面取得了优异的成绩。对于区域级任务和像素级任务,Falcon 在许多数据集上表现出色,这需要全面的定位和推理能力。相比之下,之前的 VLM 通常缺乏这些能力,甚至不支持这些能力。 |
5.3. Ablation experiments
英文原文 | 中文翻译 |
---|---|
This section presents the ablation studies to analyze the effects of data scale, task granularity, and model size on performance, as summarized in Tab. 9. The results demonstrate a consistent performance improvement as the training data scale increases — for instance, from 10% training samples to 50% training samples and ultimately to 100% training samples. Furthermore, as the task granularity becomes more refined, the model not only handles more complex tasks effectively but also enhances performance on simpler ones. A comparison between the 0.3B and 0.7B parameter models reveals that a larger parameter count leads to better generalization performance. More ablation studies can be found in Sec. I of the supplementary material. | 本节将介绍一些简化研究,以分析数据规模、任务粒度和模型大小对性能的影响,如表 9 所示。结果表明,随着训练数据规模的增加(例如,从 10% 的训练样本到 50% 的训练样本,最终到 100% 的训练样本),性能持续提升。此外,随着任务粒度的进一步细化,模型不仅能够有效地处理更复杂的任务,还能提升简单任务的性能。0.3B 和 0.7B 参数模型的比较表明,参数数量越多,泛化性能越好。更多简化研究请参见补充材料第一部分。 |
6. Conclusion
英文原文 | 中文翻译 |
---|---|
This paper develops Falcon, a holistic vision-language foundation model tailored for remote sensing with comprehensive perception and reasoning capabilities. To facilitate the training of Falcon, we further create Falcon SFT dataset which consists of approximately 78M high-quality data samples, covering 5.6M remote sensing images. Various qualitative and quantitative experiments have demonstrated that Falcon showcased remarkable zero-shot and in-dataset performance across 14 remote sensing visionlanguage tasks and more than 100 test datasets. We will release the complete dataset, code, and model weights, hoping to help further advance this research field. | 本文开发了 Falcon,一个专为遥感应用而打造的、具有全面感知和推理能力的视觉语言基础模型。为了方便 Falcon 的训练,我们进一步创建了 Falcon SFT 数据集,该数据集包含约 7800 万个高质量数据样本,涵盖 560 万幅遥感影像。各种定性和定量实验表明,Falcon 在 14 个遥感视觉语言任务和 100 多个测试数据集上展现了卓越的零样本和数据集内性能。我们将发布完整的数据集、代码和模型权重,希望能进一步推动该研究领域的发展。 |