当前位置：首页 > news >正文

RAG赋能图像知识库，让AI读懂每一帧画面

news 2025/11/11 1:20:09

局限之言：

RAG（Retrieval-Augmented Generation）技术通过整合外部知识库与生成模型，有效解决了大模型在专业领域知识局限性的问题。传统知识库主要依赖纯文本嵌入实现语义搜索和内容检索，但在处理混合格式文档（如包含文本、图像、表格的 PDF）或长上下文内容时，传统方法常面临性能瓶颈。

Cohere Embed v4 的出现为这些挑战提供了创新解决方案。它具备多模态嵌入能力，能统一处理文本、图像等不同格式的文档，如 PDF 和演示幻灯片。同时，它支持高达 128K 的上下文长度，约 200 页，适合长文档处理。这些特性显著提升了 RAG 系统的性能和适用性，使其能够更好地应对多模态数据需求和复杂文档处理场景。

核心的原理：

Cohere Embed v4 是一款专为企业打造的多模态嵌入模型，于 2025 年 4 月 15 日正式发布。它不仅支持文本、图像以及混合格式（如 PDF 和 PPT）的统一处理，还特别适用于需要处理复杂文档的企业级应用场景。

以下是它的核心亮点：

- 多模态能力：能够同时嵌入文本与图像内容，轻松应对包含多种元素的文档类型，例如报告、演示文稿和图文混排材料。
- 超长上下文支持：最大支持 128K token 的上下文长度，相当于约 200 页文档内容，非常适合长篇幅资料的处理与分析。
- 多语言覆盖：涵盖 100 多种语言，具备跨语言搜索能力，无需额外的语言识别或翻译流程即可实现全球化应用。
- 企业级安全与效率优化：专为金融、医疗等对安全性要求高的行业设计，支持私有云或本地部署，同时提供压缩嵌入技术，可节省高达 83% 的存储开销。

接下来，我们将对 Cohere Embed v4 进行实测，看看它在实际应用中的表现如何。由于嵌入模型本身主要用于语义表示，我们通常需要将其与大语言模型结合使用，比如搭配 Gemini Flash 2.5 来构建完整的 AI 应用流程。

在我们构建的基于视觉的检索增强生成（RAG）系统中，Cohere Embed v4 和 Gemini Flash 2.5 各司其职，紧密协作，共同完成任务。

Cohere Embed v4 承担着检索的重任。它能够将图像和文本转化为向量表示（嵌入），凭借这些嵌入来搜寻与用户问题最为相关的图像。其多模态嵌入能力使其可以处理文本、图像、表格、图表等多种数据类型，生成统一的向量表示，从而在复杂文档（如 PDF 报告、演示文稿）中实现快速、准确的搜索。

Gemini Flash 2.5 则专注于生成环节。作为一款强大的视觉语言模型（VLM），它具备理解图像和文本的能力，并能依据它们生成答案。

Cohere Embed v4 和 Gemini Flash 2.5 如何协作完成任务的详细流程：

1. 图像嵌入：
首先，利用 Cohere Embed v4 对所有图像进行处理，生成对应的图像嵌入，并将其存储到向量数据库中。

2. 问题嵌入：
当用户提出一个问题时，Cohere Embed v4 会将这个问题转换为相应的嵌入表示，以便后续的检索操作。

3. 检索：
系统将用户问题的嵌入与已存储的图像嵌入进行相似度计算，从而找到与问题最相关的图像。

4. 答案生成：
将检索到的相关图像和用户的问题一起输入到 Gemini Flash 2.5 中。Gemini Flash 2.5 结合图像内容和问题，生成最终的答案。

通过这种协作方式，Cohere Embed v4 负责多模态数据的嵌入和检索，而 Gemini Flash 2.5 则负责基于检索结果生成高质量的回答，实现了高效且精准的任务完成。

概括来说，Cohere Embed v4 担当“信息侦查员”，精准定位与问题匹配的图像；Gemini Flash 2.5 则是“答案构造师”，依据找到的图像和问题输出答案。二者携手，打造出视觉版 RAG 系统，使用户能用日常语言提问，轻松提取图像里的信息。

实践操作：

pip install -q cohere

# Create the Cohere API client. Get your API key from cohere.comimport coherecohere_api_key = "<<YOUR_COHERE_KEY>>" #Replace with your Cohere API keyco = cohere.ClientV2(api_key=cohere_api_key)

到 Google AI Studio 为 Gemini 生成一个 API 密钥。然后，安装 Google 生成式 AI SDK。

pip install -q google-genai

from google import genaigemini_api_key = "<<YOUR_GEMINI_KEY>>"  #Replace with your Gemini API keyclient = genai.Client(api_key=gemini_api_key)

import requestsimport osimport ioimport base64import PILimport tqdmimport timeimport numpy as np
# Some helper functions to resize images and to convert them to base64 formatmax_pixels = 1568*1568  #Max resolution for images
# Resize too large imagesdef resize_image(pil_image):    org_width, org_height = pil_image.size
    # Resize image if too large    if org_width * org_height > max_pixels:        scale_factor = (max_pixels / (org_width * org_height)) ** 0.5        new_width = int(org_width * scale_factor)        new_height = int(org_height * scale_factor)        pil_image.thumbnail((new_width, new_height))
# Convert images to a base64 string before sending it to the APIdef base64_from_image(img_path):    pil_image = PIL.Image.open(img_path)    img_format = pil_image.format if pil_image.format else "PNG"
    resize_image(pil_image)
    with io.BytesIO() as img_buffer:        pil_image.save(img_buffer, format=img_format)        img_buffer.seek(0)        img_data = f"data:image/{img_format.lower()};base64,"+base64.b64encode(img_buffer.read()).decode("utf-8")
    return img_data
# 图像列表，有本地的，也有网络的。images = {    "test1.webp": "./img/test1.webp",    "test2.webp": "./img/test2.webp",    "test3.webp": "./img/test3.webp",    "tesla.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbef936e6-3efa-43b3-88d7-7ec620cdb33b_2744x1539.png",    "netflix.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23bd84c9-5b62-4526-b467-3088e27e4193_2744x1539.png",    "nike.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5cd33ba-ae1a-42a8-a254-d85e690d9870_2741x1541.png",    "google.png": "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F395dd3b9-b38e-4d1f-91bc-d37b642ee920_2741x1541.png",    "accenture.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08b2227c-7dc8-49f7-b3c5-13cab5443ba6_2741x1541.png",    "tecent.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ec8448c-c4d1-4aab-a8e9-2ddebe0c95fd_2741x1541.png"}
# 下载图像并计算每张图像的嵌入img_folder = "img"os.makedirs(img_folder, exist_ok=True)
img_paths = []doc_embeddings = []for name, url in tqdm.tqdm(images.items()):    img_path = os.path.join(img_folder, name)    img_paths.append(img_path)
    # Download the image    if not os.path.exists(img_path):        response = requests.get(url)        response.raise_for_status()
        with open(img_path, "wb") as fOut:            fOut.write(response.content)
    # Get the base64 representation of the image    api_input_document = {        "content": [            {"type": "image", "image": base64_from_image(img_path)},        ]    }
    # Call the Embed v4.0 model with the image information    api_response = co.embed(        model="embed-v4.0",        input_type="search_document",        embedding_types=["float"],        inputs=[api_input_document],    )
    # Append the embedding to our doc_embeddings list    emb = np.asarray(api_response.embeddings.float[0])    doc_embeddings.append(emb)
doc_embeddings = np.vstack(doc_embeddings)print("\n\nEmbeddings shape:", doc_embeddings.shape)

以下展示了一个基于视觉的 RAG（检索增强生成）的简单流程。

首先执行 search()：我们为问题计算嵌入向量。然后，我们可以使用该嵌入向量在我们预先嵌入的图像库中进行搜索，以找到最相关的图像，然后返回该图像。
在 answer() 中，将问题和图像一起发送给 Gemini，以获得问题的最终答案。

# Search allows us to find relevant images for a given question using Cohere Embed v4def search(question, max_img_size=800):    # Compute the embedding for the query    api_response = co.embed(        model="embed-v4.0",        input_type="search_query",        embedding_types=["float"],        texts=[question],    )
    query_emb = np.asarray(api_response.embeddings.float[0])
    # Compute cosine similarities    cos_sim_scores = np.dot(query_emb, doc_embeddings.T)
    # Get the most relevant image    top_idx = np.argmax(cos_sim_scores)
    # Show the images    print("Question:", question)
    hit_img_path = img_paths[top_idx]
    print("Most relevant image:", hit_img_path)    image = PIL.Image.open(hit_img_path)    max_size = (max_img_size, max_img_size)  # Adjust the size as needed    image.thumbnail(max_size)    display(image)    return hit_img_path
# Answer the question based on the information from the image# Here we use Gemini 2.5 as powerful Vision-LLMdef answer(question, img_path):    prompt = [f"""Answer the question based on the following image.Don't use markdown.Please provide enough context for your answer.
Question: {question}""", PIL.Image.open(img_path)]
    response = client.models.generate_content(        model="gemini-2.5-flash-preview-04-17",        contents=prompt    )
    answer = response.text    print("LLM Answer:", answer)