当前位置：首页 > news >正文

轻量级自动驾驶多视图视觉问答模型-EM-VLM4AD

news 2025/8/28 7:05:40

EM-VLM4AD

论文

名称	内容
论文标题	Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving
论文链接	https://arxiv.org/abs/2403.19838
源码链接	akshaygopalkr/EM-VLM4AD (github.com)
收录	CVPR Workshop 2024

Abstract

过去方法的问题
- VLM 模型太大，难以实现 real-time VQA for AD
- 大多数 VLM 都是 single-image 的VQA，少有 multiple images 的VLM，特别在AD领域
主要贡献
- 提出了EM-VLM4AD模型，Efficient Multi-frame VLM for Autonomous Driving
  - 该模型在memory 和 FLOPs 上比现有的AD-VLMs少了10倍
  - 可以实现multiple images的 VQA
- explore two different lightweight LM backbones for EM-VLM4AD，两个BackBones分别是：
  - finetuned Text-to-Text Transfer Transformer (T5) Base LM
  - 8-bit quantized T5-Large LM finetuned using low-rank adaptation (LoRA)
- 与 DriveLM 数据集的 BaseLine 进行比较（四个指标BLEU-4、CIDEr、ROUGE-L、METEOR）
  - 在ROUGE-L和CIDEr指标上 stronger performance
总结展望
- In future research, we aspire to evolve our model into a video-language model capable of generating responses from multi-view video inputs, thereby enhancing EM-VLM4AD’s ability to handle temporal-related inquiries

Methods

模型整体结构如下图所示，总共分为两大部分：

Image Patch Encoder（图像编码网络）
T5-Medium/T5-Large（语言模型）

在这里插入图片描述

Image Embedding Network

这部分讲解Image Patch Encoder、Gated Pooling Attention、Projection Layer，即图片在输入进LLM（Large Language Model）之前的过程。

Image Patch Encoder 使用的是 the pretrained weights of ViT-B/32 pretrained on ImageNet，但是并没有使用ViT的整个模型，而是只用到了ViT的输入嵌入层，即生成embedding的部分
为了处理Multi-view，需要将经过编码后的每个视角的Embeddings进行合并，这里使用的是Gated Pooling Attention 和 Projection Layer
最后得到一个Multi-View Image Embedings

因此大致的流程就是输入Front, Front-Left, Front-Right, Back,Back-Left, Back-Right共六个视角的图片，然后每张图片都经过 ViT-B/23，得到Individual View Embeddings，然后通过Gated Pooling Attention 和 Projection Layer合并映射成一个Multi-View Image Embedings，最后输入进T5

具体流程如下：

输入图像形状为 $\in \mathbb{R}^{3 \times H \times W}$ ，接下来会 flattened and sliced into patches with a linear projection and positional embedding
之后的形状变为了 $Vi∈RSI×HIV_i \in \mathbb{R}^{S_I \times H_I}$ ，其中 $i$ 表示第 $i$ 张图片
- $S_I$ is the sequence length for the image embedding
- $H_I$ is the hidden dimension of the image embedding
注意：其中第一步第二步，即从 $\in \mathbb{R}^{3 \times H \times W}$ 到 $Vi∈RSI×HIV_i \in \mathbb{R}^{S_I \times H_I}$ 都是由ViT输入嵌入层完成的，其实就是把每张图片输入进行ViT输入嵌入层就行了
然后可以得到 6 个 Image Embedding，一个视角对应一个，然后Flatten每个Embedding到一维
之后使用 gated pooling attention（来自论文Mivc），关于为什么使用？论文解释如下：

Gated Pooling Attention 执行过程如下：首先会求出每个 $V_i$ 的权重 $αi\alpha_i$ ，然后进行加权求和。
$\sum_{i=1}^{N} \alpha_i V_i$
其中， $αi\alpha_i$ 计算方式如下，并且 $∑i=1Nαi=1\sum_{i=1}^{N} \alpha_i = 1$
$αi=exp⁡{wT(tanh⁡(ZViT)⊗σ(GViT))}∑j=1Nexp⁡{wT(tanh⁡(ZVjT)⊗σ(GVjT))}\alpha_i = \frac{\exp \left\{ w^T \left( \tanh (Z V_i^T) \otimes \sigma (G V_i^T) \right) \right\}}{\sum_{j=1}^{N} \exp \left\{ w^T \left( \tanh (Z V_j^T) \otimes \sigma (G V_j^T) \right) \right\}}$
其中， $\in \mathbb{R}^{K}, \; Z \in \mathbb{R}^{K \times M}, \; G \in \mathbb{R}^{K \times M}, \; M = S_I H_I$

其中 $K$ 为超参，在论文中设置为 128
通过Gated Pooling Attention后形状为 $\in \mathbb{R}^{S_I \times H_I}$ ，之后通过Projection Layer将 $V$ 投影到 $H_T$ 维度，与文本的Embedding维度相匹配，便于和文本的Embedding进行拼接变成 $R(ST+SI)×HT\mathbb{R}^{(S_T + S_I) \times H_T}$ ，其中 $S_T$ 为the sequence length of the text embedding
最后Multi-View Image Embedding的形状为 $\in \mathbb{R}^{S_I \times H_T}$

Language Model

为了减少计算量和推理耗时，论文中采用小于十亿的参数量的LLMs，使用了两个不同版本的预训练T5模型

T5-Base, which contains around 223 million parameters
an 8-bit quantized version of T5-Large (≈ 750M parameters)

将得到的 Multi-View Image Embedding 和 Text Ebedding 进行拼接，然后输入进T5模型，最后得到输出。

在实验过程中发现 fine-tuning the whole model for T5-Base works best，但是对于 the quantized T5-Large we use LoRA-Fine-Tuning-Aware Quantization

Training Process

该部分讲解训练过程，数据集相关配置如下：

DriveLM dataset
a 90%/5%/5% split of the traffic scenes

在这里插入图片描述

训练过程如下，总共分为两步：

Stage 1：冻结Image Patch Encoder和T5 LM的参数，只训练Gate Pooling Attention 和 Projection Layer。原因如下：This forces the multi-view image embeddings to align with the type of embeddings the LM expects. 意思是迫使得到的multi-view image embeddings与LM所需要的embeddings进行对齐，即本来LM是用来处理Text Embedding的，但是你用来处理图片，因此先训练GPA和PL层产生合适的Embedding，即LM所expects的Embedding。
Stage 2：只Image Patch Encoder参数冻结，同时训练T5、Gated Pooling Attention和Projection Layer。

如下是原论文中的描述

在这里插入图片描述