【RAG系列】当RAG遇到多模态 - 打开新世界的大门
当RAG遇到多模态 - 打开新世界的大门
一、跨模态检索:打破感官边界
1.1 文搜图实战流水线
CLIP编码实现
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def encode_text(text):
inputs = processor(text=text, return_tensors="pt", padding=True).to(device)
return model.get_text_features(**inputs)
def encode_image(image_path):
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt").to(device)
return model.get_image_features(**inputs)
# 跨模态检索示例
text_vec = encode_text("卡通风格的小猫")
image_vecs = load_image_database_vectors()
scores = torch.nn.functional.cosine_similarity(text_vec, image_vecs, dim=-1)
top_images = torch.topk(scores, k=5).indices
二、3D模型检索:宜家AR应用解密
2.1 3D模型编码架构
3D特征提取代码
import numpy as np
import open3d as o3d
from torch_geometric.nn import GCNConv
class MeshEncoder(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = GCNConv(3, 64) # 输入xyz坐标
self.conv2 = GCNConv(64, 256)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index).relu()
return self.conv2(x, edge_index)
# 使用示例
mesh = o3d.io.read_triangle_mesh("chair.ply")
graph = convert_mesh_to_graph(mesh) # 转换为图结构
encoder = MeshEncoder()
embedding = encoder(graph)
三、音频检索:声纹指纹技术
3.1 播客检索系统架构
多模态音频处理
from transformers import WhisperForAudioClassification, Wav2Vec2Model
class AudioRetriever:
def __init__(self):
self.asr = WhisperForAudioClassification.from_pretrained("openai/whisper-medium")
self.audio_encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
def process_audio(self, audio_path):
# 语音识别
text = self.asr.transcribe(audio_path)["text"]
# 声纹嵌入
waveform = load_audio(audio_path)
audio_features = self.audio_encoder(waveform).last_hidden_state.mean(dim=1)
return text, audio_features
# 跨模态检索示例
query_text = "讨论量子计算的播客"
audio_db = load_podcast_database()
text_scores = compute_text_similarity(query_text, audio_db.texts)
audio_scores = compute_audio_similarity(query_text, audio_db.embeddings)
final_scores = 0.6*text_scores + 0.4*audio_scores
思考题:当处理双语播客时,如何平衡不同语言的检索效果?
四、多模态模型CLIP进阶实战
4.1 多模态索引设计
混合检索实现
import faiss
import numpy as np
class MultiModalIndex:
def __init__(self, dim=512):
self.index = faiss.IndexFlatIP(dim)
self.metadata = []
def add_item(self, embedding, data_type, uri):
self.index.add(embedding)
self.metadata.append({"type": data_type, "uri": uri})
def search(self, query_vec, filter_type=None, k=5):
scores, indices = self.index.search(query_vec, k)
results = []
for idx in indices[0]:
item = self.metadata[idx]
if filter_type is None or item["type"] == filter_type:
results.append((item, scores[0][idx]))
return results
# 使用示例
index = MultiModalIndex()
index.add_item(text_vec, "text", "article_001.txt")
index.add_item(image_vec, "image", "photo_123.jpg")
results = index.search(query_vec, filter_type="image")
五、扩展思考:多模态的挑战与未来
5.1 模态对齐难题
对比损失函数
L
=
∑
(
i
,
j
)
∈
P
∣
∣
v
i
−
v
j
∣
∣
2
+
∑
(
i
,
k
)
∈
N
max
(
0
,
α
−
∣
∣
v
i
−
v
k
∣
∣
2
)
\mathcal{L} = \sum_{(i,j)\in P} ||v_i - v_j||^2 + \sum_{(i,k)\in N} \max(0, \alpha - ||v_i - v_k||^2)
L=(i,j)∈P∑∣∣vi−vj∣∣2+(i,k)∈N∑max(0,α−∣∣vi−vk∣∣2)
其中:
- ( P ):正样本对(跨模态匹配)
- ( N ):负样本对(不相关模态)
思考题:当新增红外热成像模态时,如何避免重新训练整个多模态模型?
下篇预告:《RAG变形记 - 前沿改进方案全景》
- 自适应检索:让AI自己决定何时查资料
- Hypothetical Document Embedding (HyDE)
- 递归检索:像侦探破案般的层层追问
- 检索增强的微调方案:RA-DIT技术解析
延伸阅读:
- 《Learning Transferable Visual Models From Natural Language Supervision》CLIP论文
- PointNet论文:Charles R. Qi et al.
- Hugging Face音频处理指南:https://hf.co/docs/transformers