当前位置：首页 > news >正文

解锁产品说明书的“视觉密码”：多模态 RAG 与 GPT-4 的深度融合 (AI应用与技术系列)

news 2025/9/4 11:08:18

在人工智能（AI）飞速发展的今天，大型语言模型（LLM）如 GPT-4，已经展现出强大的文本理解、生成和推理能力。然而，用户在与复杂技术产品交互时，常常依赖于产品说明书，而说明书往往不仅仅包含文字，还包含大量的图示、表格、流程图等视觉信息。如何让 AI 模型能够“读懂”这些图文并茂的文档，从而提供更准确、更具操作性的帮助，成为了一个关键的技术挑战。

本文将深入探讨“多模态检索增强生成”（Multimodal Retrieval Augmented Generation, Multimodal RAG）技术，并聚焦于其在理解产品说明书这一典型多模态文档场景的应用。我们将从 RAG 的基础原理讲起，剖析多模态信息处理的挑战，详细介绍实现多模态 RAG 的架构和关键技术，并通过 Python 代码示例，演示如何构建一个能够理解产品说明书中图文信息的智能问答系统。

引言：产品说明书的“图文鸿沟”

1.1 LLMs 在处理文档信息时的局限性

1.2 产品说明书的独特挑战：视觉信息的重要性

1.3 检索增强生成（RAG）的兴起及其单模态局限

1.4 本文的目标：构建多模态 RAG 系统，实现对产品说明书的深度理解

检索增强生成 (RAG) 基础回顾

2.1 RAG 的核心组件：Retriever 和 Generator

2.2 RAG 的工作流程与价值：减少幻觉，增强时效性

2.3 RAG 面临的挑战：对非结构化、多模态数据的处理能力

为什么需要多模态 RAG？—— 揭示产品说明书的“视觉语言”

3.1 文本与图像的协同作用：信息的互补性

3.2 产品说明书中的典型图文交互场景

安装演示（零件识别、组装步骤）

故障排除（图示诊断、连接示意）

功能介绍（按钮布局、界面导航）

3.3 单模态 RAG 的不足：丢失关键上下文

理解多模态数据：核心技术与模型

4.1 多模态嵌入 (Multimodal Embeddings)：

4.1.1 分离的模态嵌入：文本嵌入 (BERT, SBERT) vs. 图像嵌入 (ResNet, VGG)

4.1.2 跨模态嵌入 (Cross-modal Embeddings)：CLIP, ALIGN, VL-BEiT 等模型

4.1.3 联合嵌入空间：实现跨模态检索的关键

4.2 多模态数据表示与索引：

4.2.1 文档切分 (Chunking)：如何处理图文并茂的文档单元

4.2.2 索引构建：向量数据库 (FAISS, ChromaDB, Pinecone)

4.3 多模态检索 (Multimodal Retrieval)：

4.3.1 文本查询 -> 文本/图像检索

4.3.2 图像查询 -> 文本/图像检索

4.3.3 跨模态语义搜索

构建多模态 RAG 系统：技术架构与实现

5.1 系统整体架构设计

5.2 关键模块详解与代码示例：

5.2.1 数据预处理与多模态信息提取：

加载产品说明书（PDF/图片）

OCR (Optical Character Recognition) 提取文本

图像处理与关联（提取图像、匹配描述）

文档切分与结构化（将文本与关联图像打包）

5.2.2 多模态嵌入与向量化：

选择合适的模型（如 CLIP）

为文本片段和关联图像生成联合嵌入

5.2.3 向量索引与存储：

使用 ChromaDB 等 vector database 存储嵌入和原始数据

5.2.4 多模态检索器：

接收用户（文本或图像）查询

查询嵌入空间，检索最相关的 {文本, 图像} 对

5.2.5 上下文整合与 Prompt Engineering：

将检索到的多模态信息格式化为 LLM 可理解的输入

特别是，如何以文本形式描述图像或直接将图像传递给支持多模态的 LLM

5.2.6 LLM （生成器）集成：

使用 GPT-4（或其他 multimodal LLM）生成最终答案

实际应用场景与挑战

6.1 案例分析：智能客服、产品故障诊断助手

6.2 现有挑战与局限性

计算资源的需求

数据处理的复杂性（OCR 准确率、图像理解精度）

多模态信息融合的优化

评估标准的建立

未来展望

结论

1. 引言：产品说明书的“图文鸿沟”

1.1 LLMs 在处理文档信息时的局限性

当前 LLMs 的强大之处在于其卓越的自然语言理解和生成能力。当面对纯文本的知识库、FAQ 或者技术文档时，它们可以快速检索、总结、回答复杂问题。然而，许多现实世界的知识载体，尤其是技术类产品说明书、报告、手册，并非纯文本。

1.2 产品说明书的独特挑战：视觉信息的重要性

产品说明书的核心价值在于其“即学即用”的指导性。一本精心设计的说明书，不仅包含操作步骤、技术规格、故障排除指南，更辅以：

示意图 (Diagrams)：展示组件连接、内部结构、工作原理。

组装图 (Assembly Illustrations)：指导用户如何一步步将零件组装起来。

照片 (Photographs)：提供零件的实际外观、按钮布局、接口定义。

流程图 (Flowcharts)：描绘复杂的决策路径或操作流程。

这些视觉元素提供了文本无法替代的信息：空间关系、组件标识、操作的物理顺序、直观的示例。离开了这些视觉线索，单纯的文本描述往往是零散、抽象且难以理解的。例如，一个关于“如何连接电线”的问题，仅凭文本描述“将 A 端口连接到 B 端口”，远不如一张图示清晰地标明了哪个是 A 端口，它该如何插入 B 端口来得有效。

1.3 检索增强生成（RAG）的兴起及其单模态局限

为解决 LLMs 的“知识过时”和“幻觉”问题，检索增强生成（RAG）技术应运而生。RAG 的基本思路是：在 LLM 生成答案前，先从外部知识库（如产品文档）中检索相关信息，然后将检索到的信息作为上下文（Context）喂给 LLM，从而引导 LLM 生成更准确、基于事实的回答。

然而，标准的 RAG 系统主要设计用于处理纯文本知识库。当产品说明书包含丰富的图像信息时，基于文本的 RAG 系统会遇到“信息鸿沟”——它能检索到文本描述，但无法理解或利用与该文本关联的图像，从而丢失了用户查询中可能隐含的视觉线索。

1.4 本文的目标：构建多模态 RAG 系统，实现对产品说明书的深度理解

本文的目标是填补这一鸿沟，介绍并演示如何构建一个多模态 RAG 系统。该系统能够：

理解产品说明书中的文本和图像内容。

关联文本信息与相关的图像信息。

检索与用户（文本或图像）查询最相关的图文信息。

整合这些多模态上下文，并将其输入给一个（支持多模态输入的）LLM，生成准确、贴合实际的产品使用指导。

我们将通过技术架构设计和 Python 代码示例，展示这一过程的关键步骤。

2. 检索增强生成 (RAG) 基础回顾

在深入多模态 RAG 之前，我们先简要回顾一下单模态 RAG 的工作原理。

2.1 RAG 的核心组件：Retriever 和 Generator

一个典型的 RAG 系统包含两个核心组件：

Retriever (检索器)：负责从预先建立好的知识库（文档集合）中，根据用户输入的查询（Query），检索出最相关的文本片段（Context）。

Generator (生成器)：通常是一个大型语言模型（LLM），它接收用户的查询以及检索器返回的相关文本片段，然后生成最终的答案。

2.2 RAG 的工作流程与价值：减少幻觉，增强时效性

RAG 的工作流程如下：

知识库构建：将大量的文档（如产品说明书的文本部分）进行切分（Chunking），提取文本段落，并将其转换为向量（Embeddings）。这些向量通常使用预训练的语言模型（如 Sentence-BERT）生成，能够捕捉文本的语义信息。

向量索引：将这些文本向量存储在一个高效的向量数据库中，以便快速进行相似性搜索。

用户查询：用户提出一个问题。

查询向量化：用户查询也被向量化。

信息检索：在向量数据库中，找到与查询向量最相似（即语义最相关）的文本片段。

上下文增强：将检索到的文本片段与原始用户查询一起，构建一个增强的 Prompt。

LLM 生成：将增强后的 Prompt 输入给 LLM，LLM 基于提供的上下文生成答案。

RAG 的价值：

减少幻觉 (Hallucination)： LLM 的回答被“事实”所“锚定”，降低了生成无中生有或不准确信息的概率。

知识更新：当知识库更新时，无需重新训练 LLM，只需更新知识库即可。

领域适应性：可以用于任何特定领域（如金融、医疗、法律），通过加载特定领域的文档来提升 LLM 在该领域的表现。

2.3 RAG 面临的挑战：对非结构化、多模态数据的处理能力

尽管 RAG 强大，但它主要局限于处理纯文本信息。对于包含图片、表格、图表的多模态文档，标准 RAG 无法理解视觉内容，丢失了大量关键信息。例如，用户可能会问：“图 3.5 的箭头指向哪个接口？” 依靠纯文本 RAG 难以回答此类问题。

3. 为什么需要多模态 RAG？—— 揭示产品说明书的“视觉语言”

3.1 文本与图像的协同作用：信息的互补性

产品说明书是文本和图像协同工作、共同传递信息的范例。文本提供指令、定义和解释，而图像则提供直观的视觉证据，填补文本描述的空白，直观展示复杂的上下文。

视觉辅助文本：图像可以直观地展示某个组件的样子，帮助用户在实体产品中找到它。

文本解释视觉：文本可以标注图像中的关键部分，解释其含义或功能。

空间与结构信息：图像是表达空间布局、组件相对位置、连接关系以及操作序列的天然载体。

3.2 产品说明书中的典型图文交互场景

安装演示（零件识别、组装步骤）： “按照图 2.1，将零件 A（红色标记）插入底座 B 的凹槽 C。” 这里，用户需要识别零件 A，理解其在图中的形状和位置，以及如何插入凹槽 C。

故障排除（图示诊断、连接示意）： “如果指示灯 D 呈绿色闪烁，请检查图 4.3 中的连接线 E 是否松动。” 用户需要找到图 4.3，识别连接线 E，并确认其是否如示意图所示般正确连接。

功能介绍（按钮布局、界面导航）： “要进入设置菜单，请按下前面板上标记为‘SET’的按钮，如图 1.1 所示。” 用户需要找到“SET”按钮，而图 1.1 会直观展示按钮的位置。

3.3 单模态 RAG 的不足：丢失关键上下文

在上述场景下，如果 RAG 只检索文本信息，它可能会找到“将 A 端口连接到 B 端口”的描述，但无法告知用户 A 端口长什么样，B 端口在哪里。它可能找到“检查连接线 E 是否松动”，但无法指向图示中具体是哪根线，以及“松动”在视觉上是如何体现的。这种信息丢失，使得 RAG 在处理产品说明书时，其辅助能力大打折扣。

4. 理解多模态数据：核心技术与模型

要实现多模态 RAG，首要任务是让 AI 系统能够理解和处理文本与图像信息，并将它们关联起来。

4.1 多模态嵌入 (Multimodal Embeddings)

4.1.1 分离的模态嵌入

传统的文本模型（如 BERT, RoBERTa, Sentence-BERT）和图像模型（如 ResNet, VGG, ViT）各自生成其模态的嵌入向量，但这些向量通常在不同的空间中，无法直接进行跨模态的相似度比较。

4.1.2 跨模态嵌入 (Cross-modal Embeddings)—— CLIP 的范例

跨模态嵌入模型（如 OpenAI 的 CLIP, Google T-LM, Facebook ALIGN, Microsoft’s VL-BEiT 等）是解决这一问题的关键。它们通过在大规模图文对数据集上进行训练，学习将不同模态的信息映射到同一个联合嵌入空间 (Joint Embedding Space)。

CLIP (Contrastive Language–Image Pre-training)： CLIP 包含一个文本编码器和一个图像编码器。它们将文本和图像都编码成固定维度的向量（嵌入）。通过对比学习（Contrastive Learning），CLIP 使得一个描述性文本的嵌入向量与它所描述的图像的嵌入向量在联合空间中彼此靠近，而与其他不相关的文本/图像嵌入向量则相距甚远。

CLIP 的工作方式：

训练：给定大量的 (图像, 文本描述) 对。模型训练目标是最大化正确匹配的 (图像, 文本) 对的相似度，最小化不匹配对的相似度。

推断：

给定一段文本（如用户查询），通过文本编码器获得其文本嵌入。

给定一张图像，通过图像编码器获得其图像嵌入。

通过计算它们在联合空间中的余弦相似度，可以判断文本对图像的描述是否准确，或者（更重要的）——可以用文本嵌入去搜索最相似的图像嵌入，反之亦然。

4.2 多模态数据表示与索引

4.2.1 文档切分 (Chunking)

将一本产品说明书（通常是 PDF 文件）分解成有意义的信息单元是第一步。一个好的切分策略应该能够保留文本与图像的关联性。

页面级切分：将每一页作为一个单元。可能包含该页的文字内容以及该页上的所有图像。

区域级切分：更精细地，可以将一页中的一段文字与其关联的图像（如某个图示，或标有“图 3.5”的图像）组合成一个“多模态块”。

图文关联：提取页面中的图像，并通过 OCR 识别图像中的图注（Caption）或标题，作为该图像的文本描述。

4.2.2 索引构建

一旦我们将文档分解为包含文本和图像（或图像描述）的多模态块，就需要建立一个能够进行跨模态语义搜索的索引。

联合嵌入：使用 CLIP 等模型，为每个多模态块的文本内容和关联图像的视觉特征（或者图像的描述性文本 OCR 结果）生成联合嵌入向量。

向量数据库：将这些联合嵌入向量及对应的元数据（如原始文本、图像路径、页码、原始块ID）存储在向量数据库中。向量数据库（如 FAISS, ChromaDB, Weaviate, Pinecone）能够高效地进行高维向量的最近邻搜索 (Approximate Nearest Neighbor, ANN)。

4.3 多模态检索 (Multimodal Retrieval)

4.3.1 跨模态查询

用户可以以文本或图像作为查询。

文本查询：用户输入一个自然语言问题，例如：“如何连接电源线？图 3.2 表示了什么”。

图像查询：用户上传一张来自产品的照片，然后问：“这个部件叫什么名字？”或者“类似这个连接方式，在哪里可以看到？”

4.3.2 检索过程

查询向量化：

如果查询是文本，使用 CLIP 的文本编码器将其转换为文本嵌入。

如果查询是图像，使用 CLIP 的图像编码器将其转换为图像嵌入。

向量数据库搜索：在向量数据库中，使用用户查询的嵌入向量，找到与之最相似（例如，基于余弦相似度）的 K 个多模态块的嵌入向量。

结果提取：从数据库中检索出对应的原始文本片段和图像信息（如图像文件路径或 OCR 文本）。

5. 构建多模态 RAG 系统：技术架构与实现

现在，我们来构建一个实际的多模态 RAG 系统，用于理解产品说明书。我们将采用一种端到端的 Pipeline，结合 CLIP 进行多模态嵌入和检索，并假设有一个支持多模态输入的 LLM（如 GPT-4V 或 LLaVA）。

5.1 系统整体架构设计

数据处理层：

数据加载器：读取 PDF 或图像文件。

OCR 引擎：从（可能是扫描版）PDF 页面的图像中提取文字。

图像处理：识别页面上的主要图示、表格，提取与文本关联的图像。

文本/图像关联器：将页面的文本内容与页面上的主要图像/图示进行绑定。

文档分块：将关联的图文信息切分成逻辑单元（多模态块）。

嵌入与索引层：

多模态嵌入器：使用 CLIP 等模型，为每个多模态块生成统一的向量表示。

向量数据库：存储嵌入向量及对应的原始文本/图像元数据。

检索层 (Retriever)：

查询处理器：接收文本或图像查询，进行向量化。

相似度搜索：在向量数据库中执行 K-NN 搜索。

信息聚合：收集最相关的文本片段和图像信息。

生成层 (Generator)：

Prompt 构建器：将检索到的多模态信息（文本片段 + 图像路径/描述）格式化成一个适合多模态 LLM 的 Prompt。

多模态 LLM (如 GPT-4V)：接收 Prompt，理解文本和图像信息，并生成最终答案。

5.2 关键模块详解与代码示例

我们将演示核心的嵌入、索引和检索部分。对于数据预处理和 LLM 集成，我们提供概念性的说明。

所需库：

<BASH>

pip install transformers torch sentence-transformers faiss-cpu Pillow chromadb openai # openai for potential LLM API call simulation

(注意：faiss-cpu 是 FAISS CPU 版本，GPU 版本 faiss-gpu 需额外配置cuda)

模拟数据准备：

假设我们有一个产品说明书，包含几页。为了简化，我们将用文本文件模拟页面内容，并用占位符表示图像。

import os

import textwrap

from PIL import Image # Pillow for image handling

import numpy as np

# --- 模拟数据结构 ---

# 每一页（或块）包含文本和关联的图像信息

# 格式: {'id': unique_id, 'text': page_text, 'image_path': path_to_image_or_None, 'metadata': {}}

simulated_manual_data = [

{

'id': 'page_1',

'text': "本产品是一款智能音箱。\n第一部分：开箱与初次设置。\n请参照图1.1检查包装内的所有组件。",

'image_path': 'images/manual_page_1_fig_1.1.png', # Placeholder for Figure 1.1

'metadata': {'page_num': 1}

{

'id': 'page_2_setup',

'text': "初次设置步骤：\n1. 将电源适配器 (A) 连接到音箱背面的 DC IN 接口 (B)。\n2. 插入插座。\n3. 按下电源按钮 (C) 启动设备。",

'image_path': 'images/manual_page_2_assembly.png', # Placeholder for assembly image

'metadata': {'page_num': 2}

{

'id': 'page_3_troubleshoot',

'text': "故障排除：\n如果音箱无法开机，请检查电源连接是否牢固，参照图3.1。",

'image_path': 'images/manual_page_3_fig_3.1_power.png', # Placeholder for troubleshooting diagram

'metadata': {'page_num': 3}

{

'id': 'page_4_buttons',

'text': "主要功能按钮：\n音量加/减按钮位于顶部，电源按钮位于背面，模式切换按钮位于侧面。",

'image_path': 'images/manual_page_4_buttons_layout.png', # Placeholder for button layout image

'metadata': {'page_num': 4}

}

]

# 为代码示例创建一些占位符图像

dummy_image_dir = "images"

os.makedirs(dummy_image_dir, exist_ok=True)

Image.new('RGB', (60, 30), color = 'red').save(os.path.join(dummy_image_dir, 'manual_page_1_fig_1.1.png'))

Image.new('RGB', (100, 150), color = 'blue').save(os.path.join(dummy_image_dir, 'manual_page_2_assembly.png'))

Image.new('RGB', (80, 120), color = 'green').save(os.path.join(dummy_image_dir, 'manual_page_3_fig_3.1_power.png'))

Image.new('RGB', (70, 100), color = 'yellow').save(os.path.join(dummy_image_dir, 'manual_page_4_buttons_layout.png'))

5.2.2 多模态嵌入与向量化

我们将使用 sentence-transformers 库加载 Sentence-BERT 来处理文本，并使用 transformers 库加载 CLIP 模型来处理图像。CLIP 模型能够将文本和图像映射到同一个向量空间。

from sentence_transformers import SentenceTransformer

from PIL import Image

from transformers import CLIPProcessor, CLIPModel

import torch

import chromadb

from chromadb.utils import embedding_functions

import os

# --- 配置 ---

EMBEDDING_MODEL_TEXT = 'all-MiniLM-L6-v2' # Sentence-BERT model for text embedding

EMBEDDING_MODEL_CLIP = 'openai/clip-vit-base-patch32' # CLIP model for joint embedding

CHROMA_COLLECTION_NAME = "product_manuals_multimodal"

# --- 初始化模型 ---

# Sentence-BERT for text embeddings (can be used if we want to retrieve based on text only)

# text_model = SentenceTransformer(EMBEDDING_MODEL_TEXT)

# CLIP for multimodal embeddings (text and image together)

clip_model = CLIPModel.from_pretrained(EMBEDDING_MODEL_CLIP)

clip_processor = CLIPProcessor.from_pretrained(EMBEDDING_MODEL_CLIP)

clip_device = "cuda" if torch.cuda.is_available() else "cpu"

clip_model.to(clip_device)

# --- ChromaDB 客户端与集合 ---

# ChromaDB is a vector database that can store embeddings and metadata.

chroma_client = chromadb.Client()

# Create an embedding function for CLIP. ChromaDB needs to know how to call CLIP.

# We will use the text encoder for text and image encoder for images separately,

# but store them in a way that allows similarity search within the same collection.

# For simplicity, we'll embed text and 'image_path' separately and then search.

# More advanced: some ChromaDB integrations might allow direct CLIP usage for joint embedding.

# Here, we simulate by setting up a single collection that will hold both text and image embeddings.

# We'll handle the joint search logic externally or by convention.

# For simplicity, let's create a custom embedding function for ChromaDB that uses CLIP.

# This specific mapping needs careful handling: we embed texts and images separately,

# and rely on the *external* search logic to use CLIP's joint space.

# A slicker integration would involve a multi-vector embedding function if ChromaDB supports it.

# For demonstration, we'll embed all texts and all images, and then perform search by embedding query as text or image.

class CLIPEmbeddingFunction:

def __init__(self, model_name, device, model_type="text"): # model_type can be "text" or "image"

self.device = device

self.model_name = model_name

self.model_type = model_type

self.clip_model = CLIPModel.from_pretrained(model_name).to(self.device)

self.clip_processor = CLIPProcessor.from_pretrained(model_name)

if model_type == "text":

self.model = SentenceTransformer(EMBEDDING_MODEL_TEXT).to(self.device) # Using SentenceTransformer for text initial embed

self.tokenizer = self.model.tokenizer

self.dim = self.model.get_sentence_embedding_dimension()

elif model_type == "image":

self.model = self.clip_model.get_image_features

self.dim = self.clip_model.config.vision_config.hidden_size # CLIP image embed dim

else:

raise ValueError("model_type must be 'text' or 'image'")

def __call__(self, input_texts):

if self.model_type == "text":

# Using SentenceTransformer for text embeddings for simplicity here

# Although CLIP text encoder is also an option, SentenceTransformer is often faster for pure text.

# For true joint embedding search, ideally both text and image use CLIP's encoders.

# We'll refine this later to use CLIP text encoder for query and search.

return self.model.encode(input_texts, convert_to_numpy=True)

elif self.model_type == "image":

# Process images for CLIP

# Simulate loading images if paths are provided

if isinstance(input_texts, list) and all(isinstance(i, str) and os.path.exists(i) for i in input_texts):

images = [Image.open(img_path).convert("RGB") for img_path in input_texts]

inputs = self.clip_processor(images=images, return_tensors="pt", padding=True).to(self.device)

with torch.no_grad():

# Get image features from CLIP's image encoder

# We need to ensure the dimension matches if trying to store in same index

image_features = self.clip_model.get_image_features(**inputs)

return image_features.cpu().numpy()

elif isinstance(input_texts, list) and all(isinstance(i, Image.Image) for i in input_texts):

inputs = self.clip_processor(images=input_texts, return_tensors="pt", padding=True).to(self.device)

with torch.no_grad():

image_features = self.clip_model.get_image_features(**inputs)

return image_features.cpu().numpy()

else:

raise ValueError("Input must be a list of image paths or PIL Images for image embedding.")

# Re-initialize models for clarity and proper use with Chroma

# Use SentenceTransformer for text embeddings fed into Chroma, and CLIP's image encoder for images.

# The search logic will then use CLIP to bridge the gap: text_query -> text_emb -> search -> retrieve text_chunk + image_chunk.

# OR image_query -> image_emb -> search -> retrieve text_chunk + image_chunk.

# A cleaner approach for CLIP is to use its encoders for EVERYTHING. Let's use CLIP for both.

class CLIPJointEmbeddingFunction:

def __init__(self, model_name, device):

self.device = device

self.model_name = model_name

self.clip_model = CLIPModel.from_pretrained(model_name).to(self.device)

self.clip_processor = CLIPProcessor.from_pretrained(model_name)

# Determine the target embedding dimension (can be text or image encoder output dim)

# CLIP's encoders project to the same hidden_size usually.

self.dim = self.clip_model.config.hidden_size

def encode_text(self, texts):

inputs = self.clip_processor(text=texts, return_tensors="pt", padding=True).to(self.device)

with torch.no_grad():

text_features = self.clip_model.get_text_features(**inputs)

return text_features.cpu().numpy()

def encode_image(self, image_paths):

images = [Image.open(img_path).convert("RGB") for img_path in image_paths if os.path.exists(img_path)]

if not images:

return np.array([]) # Return empty if no valid images

inputs = self.clip_processor(images=images, return_tensors="pt", padding=True).to(self.device)

with torch.no_grad():

image_features = self.clip_model.get_image_features(**inputs)

return image_features.cpu().numpy()

def __call__(self, input_texts):

# This __call__ is what ChromaDB uses. It expects a list of texts.

# We need to adapt it. For ChromaDB, we usually have one embedding per document/item.

# A common strategy for multimodal is to store MULTIPLE embeddings per document,

# or use a combined embedding logic.

# For simplicity in ChromaDB, we'll embed texts and images *separately* and manage them.

# Alternatively, we could create a hybrid document representation.

# Let's assume we embed texts into one collection and related images into another,

# or store both embeddings for a single document ID.

# ChromaDB's `EmbeddingFunction` expects `input_texts`.

# We will handle the logic of which encoder to use in the indexing/retrieval phase,

# not here in the ChromaDB embedding function itself.

# A simplified approach for ChromaDB would be to embed text part of the document.

# For image retrieval, we'll need a separate step.

# --- REVISITING CHROMADB & MULTIMODAL ---

# ChromaDB allows storing multiple embeddings per document ID.

# However, its primary `embeddin_function` is usually for a single type.

# A common workaround involves:

# 1. Create a collection for text content.

# 2. Create a collection for image content (or specific image embeddings).

# 3. Maintain a mapping between them.

# OR: Embed text *and* an image description/path associated with it, and store them as separate entries

# but link them via metadata.

# Let's use approach 3: Store each text chunk and its associated image as separate items in ChromaDB,

# linked by the same `document_id` and `page_id` in metadata.

# Then, use CLIP's text encoder for text queries, CLIP's image encoder for image queries,

# and CLIP's cross-emdeddings to find similar items (regardless of modality).

print("Using CLIPJointEmbeddingFunction - __call__ expects list of strings (text)")

# This function as defined will only encode texts using CLIP's text encoder.

# Image encoding will need to be called explicitly.

return self.encode_text(input_texts)

# Instantiate the embedding function

clip_embedding_function = CLIPJointEmbeddingFunction(EMBEDDING_MODEL_CLIP, clip_device)

# --- ChromaDB Setup & Indexing ---

# Clean up previous collection if it exists

try:

chroma_client.delete_collection(CHROMA_COLLECTION_NAME)

print(f"Deleted existing collection: {CHROMA_COLLECTION_NAME}")

except:

pass # Collection doesn't exist

# Create collection, specifying the dimension (CLIP encoder output dimension)

# We need to decide how to handle dimensions if text/image encoders output differently,

# but CLIP usually projects them to the same size.

collection = chroma_client.create_collection(

name=CHROMA_COLLECTION_NAME,

metadata={"hnsw:space": "cosine"} # Using cosine similarity

)

# Helper to add data to ChromaDB

def add_to_chroma(doc_id, text_content, image_path, metadata):

# Embed text part using CLIP's text encoder

text_embedding = clip_embedding_function.encode_text([text_content])

# Embed image part using CLIP's image encoder

image_embedding = np.array([])

if image_path and os.path.exists(image_path):

image_embedding = clip_embedding_function.encode_image([image_path])

# Add text content to collection

collection.add(

embeddings=[text_embedding[0].tolist()], # Chroma expects list of embeddings

documents=[text_content],

metadatas=[metadata],

ids=[f"{doc_id}_textelem"]

)

# Add image content (or its embedding) to collection

if image_embedding.size > 0:

collection.add(

embeddings=[image_embedding[0].tolist()],

documents=[f"Image: {image_path}"], # Store image path/description as document

metadatas={**metadata, 'type': 'image', 'image_path': image_path}, # Add image type and path

ids=[f"{doc_id}_imageelem"]

)

print(f"Added {doc_id}: text and potentially image.")

# Populate ChromaDB with simulated data

print("\nPopulating ChromaDB...")

for item in simulated_manual_data:

add_to_chroma(item['id'], item['text'], item['image_path'], item['metadata'])

print(f"\nChromaDB collection '{CHROMA_COLLECTION_NAME}' created with {collection.count()} items.")

# --- Multi-modal Retriever ---

# Now, we need a retriever that can query using Text or Image and use CLIP's joint space.

# ChromaDB supports finding nearest neighbors. For cross-modal search, we need to embed the query

# using the appropriate CLIP encoder and then search.

def query_multimodal_rag(query_text=None, query_image_path=None, k=3):

if not query_text and not query_image_path:

print("Error: Must provide either query_text or query_image_path.")

return None, None

query_embedding = None

query_type = ""

if query_text:

print(f"\nQuerying with text: '{query_text}'")

query_embedding = clip_embedding_function.encode_text([query_text])

query_type = "text"

elif query_image_path and os.path.exists(query_image_path):

print(f"\nQuerying with image: '{query_image_path}'")

query_embedding = clip_embedding_function.encode_image([query_image_path])

query_type = "image"

else:

print("Invalid image path provided for query.")

return None, None

# Perform similarity search in ChromaDB

# ChromaDB's query function will compare the query_embedding against all stored embeddings.

results = collection.query(

query_embeddings=[query_embedding[0].tolist()], # Chroma expects a list of query embeddings

n_results=k,

include=['documents', 'metadatas', 'distances']

)

# Process results

retrieved_items = []

if results and results['ids'] and results['ids'][0]:

for i in range(len(results['ids'][0])):

retrieved_items.append({

'id': results['ids'][0][i],

'document': results['documents'][0][i],

'metadata': results['metadatas'][0][i],

'distance': results['distances'][0][i]

})

return retrieved_items, query_type

# --- Example Usage ---

print("\n--- Testing Multimodal Retrieval ---")

# Example 1: Text query expecting text and image context

text_query = "How to connect the power adapter?"

retrieved_docs_text, query_type_text = query_multimodal_rag(query_text=text_query, k=2)

print("\n--- Text Query Results ---")

if retrieved_docs_text:

for item in retrieved_docs_text:

print(f" ID: {item['id']}, Distance: {item['distance']:.4f}")

print(f" Metadata: {item['metadata']}")

print(f" Content: {textwrap.shorten(item['document'], width=100, placeholder='...')}\n")

# Example 2: Image query (simulating an image of the power connector)

# Let's use page_2_assembly.png as a query image to find related text

image_query_path = os.path.join(dummy_image_dir, 'manual_page_2_assembly.png')

retrieved_docs_image, query_type_image = query_multimodal_rag(query_image_path=image_query_path, k=2)

print("\n--- Image Query Results ---")

if retrieved_docs_image:

for item in retrieved_docs_image:

print(f" ID: {item['id']}, Distance: {item['distance']:.4f}")

print(f" Metadata: {item['metadata']}")

print(f" Content: {textwrap.shorten(item['document'], width=100, placeholder='...')}\n")

# --- 5.2.5 Context Aggregation & Prompting for LLM ---

def build_llm_prompt(query, retrieved_context, llm_type="multimodal"):

prompt = f"User Query: {query}\n\n"

prompt += "Context from Product Manual:\n"

if llm_type == "multimodal":

# For multimodal LLMs (like GPT-4V), we can present text and include image references.

# The actual mechanism to pass images to the LLM depends on the specific API.

# Here, we represent images by their paths or descriptive info.

image_count = 0

for item in retrieved_context:

if 'type' in item['metadata'] and item['metadata']['type'] == 'image':

prompt += f" - Image Snippet {image_count}: Referencing file '{item['metadata'].get('image_path', 'N/A')}'. Likely shows: {textwrap.shorten(item['document'], width=50, placeholder='...')}\n"

image_count += 1

else:

prompt += f" - Text Snippet: {textwrap.shorten(item['document'], width=150, placeholder='...')}\n"

prompt += "\nBased on the above text and image context, answer the user's query.\n"

else: # For text-only LLMs, we'd need to convert images to text descriptions/captions.

# This part requires an image captioning model or manual descriptions.

# For simplicity, we are assuming a multimodal LLM here for direct integration.

prompt += "Context will be presented to a multimodal LLM.\n"

return prompt

# Example of building a prompt (conceptual for a multimodal LLM)

if retrieved_docs_text:

llm_context_prompt = build_llm_prompt(text_query, retrieved_docs_text, llm_type="multimodal")

print("\n--- LLM Prompt Construction (Conceptual for Multimodal LLM) ---")

print(llm_context_prompt)

# To make this fully functional, you would then pass this prompt,

# along with the actual image files/data, to a multimodal LLM API (e.g., GPT-4V).

# Example:

# response = multimodal_llm_api.chat.completions.create(

# model="gpt-4-vision-preview",

# messages=[

# {"role": "system", "content": "You are a helpful AI assistant assisting with product manuals."},

# {"role": "user", "content": [

# {"type": "text", "text": "User Query: How to connect the power adapter?"},

# {"type": "text", "text": "Context from Product Manual:"},

# # Assume we construct a list of messages for multimodal input

# # Example for the first retrieved item (text)

# {"type": "text", "text": retrieved_docs_text[0]['document']},

# # Example for the first retrieved item (image) IF it's an image

# # This part is highly API specific, requires actual image data.

# # {"type": "image_url", "image_url": {"url": "path/to/image_file.png"}},

# # ... construct messages for all retrieved context ...

# ]}

# ]

# )

# print("\nLLM Response:", response.choices[0].message.content)

5.2.2 多模态嵌入与向量化（代码解释）

模型选择：我们使用 openai/clip-vit-base-patch32 模型，它包含了强大的文本编码器和图像编码器，可以将两者映射到同一向量空间。

CLIPJointEmbeddingFunction：这个类封装了 CLIP 模型的使用。encode_text 使用 CLIP 的文本编码器，encode_image 使用 CLIP 的图像编码器。

ChromaDB 集成： ChromaDB 是一个易于使用的向量数据库。我们创建了一个 CLIPJointEmbeddingFunction，但 ChromaDB 的 add 方法通常一次只处理一种类型的输入（文本或图像）。

策略：我们采用一种分而治之但关联的方法：

对于每个多模态块（页面/部分），我们将其文本内容和关联的图像路径分别添加到 ChromaDB。

文本内容被嵌入使用 CLIP 的文本编码器。

图像内容（通过其文件路径）被嵌入使用 CLIP 的图像编码器。

关键在于，我们为同一个多模态块的文本和图像条目共享相同的 id（如 page_1），并通过元数据（metadata）中的 page_num 和 image_path 字段来关联它们。

实现细节：

add_to_chroma 函数负责将一个多模态块（文本+图像）添加到 ChromaDB。它分别嵌入文本和图像，并存储为两个独立的条目，但共享 doc_id 和元数据，以便检索后能够一起被关联。

query_multimodal_rag 函数处理来自用户（文本或图像）的查询。它根据查询类型，使用 CLIP 的相应编码器生成查询嵌入，然后使用 ChromaDB 进行相似度搜索。 ChromaDB 本身会找到所有与查询向量相似的文本和图像嵌入。

5.3 向量索引与存储

ChromaDB：我们将其初始化为本地的向量数据库。

Collection：创建一个名为 product_manuals_multimodal 的集合，指定使用余弦相似度 (hnsw:space": "cosine")。

数据添加 Enrich： add_to_chroma 函数确保了文本内容（作为 documents）和对应的图像路径（作为 documents 的一部分，并附带 image_path 元数据）都被存储，并且嵌入向量是 CLIP 生成的。

5.4 多模态检索器

query_multimodal_rag 函数：这是核心检索逻辑。

它接受文本查询或图像路径。

使用 clip_embedding_function 将查询转换为 CLIP 的嵌入向量。

调用 collection.query 在 ChromaDB 中执行相似度搜索，寻找与查询向量最相似的 K 个条目（无论它们是文本还是图像的嵌入）。

返回检索到的文档内容、元数据和距离。

5.5 上下文整合与 Prompt Engineering

build_llm_prompt 函数：这个函数演示了如何将检索到的多模态信息（文本片段和图像信息）整合成一个适合多模态 LLM 的 Prompt。

对于支持多模态输入的 LLM (如 GPT-4V)，我们可以直接在 Prompt 中引用图像文件路径，或者提供图像的文本描述。

重要提示：实际发送图像到 LLM API 是通过特定的格式，例如在 GPT-4V API 中，图像是以 URL 或 Base64 编码的字符串形式嵌入到消息列表中。此代码示例仅展示了如何组织这些信息。

5.6 LLM（生成器）集成

多模态 LLM：最终，经过 RAG 检索到的、被精心整合到 Prompt 中的多模态上下文，会被送到一个能够理解文本和图像的 LLM（如 GPT-4V, LLaVA, BLIP models 后接 LLM 等）。

生成答案： LLM 综合分析用户查询以及提供的图文上下文，生成最终的、包含视觉信息的精确答案。

6. 实际应用场景与挑战

6.1 案例分析：智能客服、产品故障诊断助手

智能客服：用户上传一张设备照片，询问“这个接口是做什么用的？”，系统通过多模态 RAG 找到包含该接口图像和描述的页面，然后喂给 LLM，LLM 回答：“这是 USB-C 充电接口，用于连接电源适配器或数据传输。”

产品故障诊断助手：用户描述故障现象：“我的打印机卡纸了，指示灯如图 X 所示闪烁。” 系统检索该指示灯闪烁模式的图文信息，并结合用户提供的图示，指导用户进行相应操作。

6.2 现有挑战与局限性

计算资源：运行 CLIP 大型模型进行嵌入和检索，需要较强的计算能力。

数据处理复杂性：

OCR 准确率：对于低质量扫描件、复杂排版或手写体，OCR 准确率可能受影响，进而影响文本嵌入和检索效果。

图像理解精度： CLIP 等模型的图像理解能力虽强，但对于非常具体、技术性强的图示（如复杂的电路图、精确的尺寸标注），理解可能仍有待提高。

图文关联的自动化：准确地将页面中的特定图示与其最相关的文本片段自动关联起来，是一个具有挑战性的信息提取任务。

多模态信息融合的优化：如何更好地将文本和图像的信息融合成一个统一的、易于 LLM 理解的上下文，并且使 LLM 能有效利用这些多模态信息，是 Prompt Engineering 和模型设计的重要方向。

评估标准的建立：如何量化一个多模态 RAG 系统在理解产品说明书方面的“准确性”和“有用性”，需要专门的评估指标和数据集。

7. 未来展望

多模态 RAG 正处于快速发展阶段。未来的研究和应用将聚焦于：

更高效、更鲁棒的多模态模型：提升 CLIP 等模型的图像理解、图文关联能力，降低计算开销。

精细化的文档切分与信息提取：自动化识别文档结构、图表、关键组件，实现图文的精准配对。

交互式多模态 RAG：允许用户在检索结果中进行迭代式的筛选和细化查询，获得更精准的答案。

结合其他技术：如知识图谱（Knowledge Graph）可以为产品说明书提供结构化的知识体系，与多模态 RAG 结合，可能带来更深度的理解。

8. 结论

产品说明书作为用户理解和使用复杂产品的重要桥梁，其价值高度依赖于图文并茂的信息传递。传统的基于文本的 RAG 系统，在该场景下存在显著的“视觉鸿沟”。

本文提出的多模态 RAG，通过引入 CLIP 等跨模态模型，能够将产品说明书的文本与图像内容统一到联合嵌入空间中，实现跨模态的语义检索。通过精心设计数据处理流程、向量索引策略以及 Prompt Engineering，我们可以构建一个智能系统，让 LLM 能够“看懂”说明书中的图示、对话用户关于可视组件的问题，并提供精准的图文并茂的解答。

虽然仍存在计算资源、数据处理精度等挑战，但多模态 RAG 已经为 AI 深度理解复杂、非结构化文档提供了强有力的技术路径，是解锁人机交互新维度、提升 AI 服务能力的关键一步。随着技术的不断成熟，我们可以期待 AI 在更广泛的领域的应用将更加深入和人性化，真正成为人类智能的强大助手。

查看全文

http://www.dtcms.com/a/365683.html