当前位置：首页 > news >正文

LLM学习：大模型基础——视觉大模型以及autodl使用

news 2025/9/8 11:10:48

1、常见的VLM

在大模型中，VLM 是视觉语言模型（Vision-Language Model）的缩写，是一种多模态、生成式 AI 模型，能够理解和处理视频、图像和文本。
VLM 通过将大语言模型（LLM）与视觉编码器相结合构建而成，使 LLM 具有 “看” 的能力，从而可以处理并提供对提示中的视频、图像和文本输入的高级理解，以生成文本响应。与传统的计算机视觉模型不同，VLM 不受固定类别集或特定任务约束，在大量文本和图像 / 视频字幕对的语料上进行重新训练后，它可以用自然语言进行指导，用于处理许多典型的视觉任务以及新的生成式 AI 任务，例如摘要和视觉问答。

常见的VLM有以下几个：

        GPT-4V：属于分析型 VLM，是 OpenAI 开发的强大视觉语言模型，能够理解和处理图像与文本的组合输入，并生成文本响应，在视觉问答、图像描述等多种任务上表现出色。
        Qwen2.5-VL：是阿里云的旗舰视觉语言模型，有 30 亿、70 亿和 720 亿参数三种规模，使用 ViT 视觉编码器和 Qwen 2.5 LLM，它可以理解长度为一个小时以上的视频，并可以浏览桌面和智能手机界面。
        Claude 4：也是分析型 VLM 的代表之一，由 Anthropic 公司开发，具备强大的语言理解和生成能力，同时在处理视觉相关任务时也有很好的表现，能够准确回答关于图像内容的问题等。

2、qwen-VL图像理解实例

通过qwen-VL读取几张图片，提示词和图片从excel中读取，将最终的结果也输出到excel中。

import os
import dashscope
from dashscope.api_entities.dashscope_response import Role
from dashscope import MultiModalConversation
import pandas as pd
dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')absolute_path = os.path.dirname(os.path.abspath(__file__))
def get_response(user_prompt, image_url):# 得到messageslocal_file_path = f'file://{absolute_path}\\{image_url}.jpg'messages = [{'role': 'system','content': [{'text': 'You are a helpful assistant.'}]}, {'role':'user','content': [{'image': f'{local_file_path}'},{'text': f'{user_prompt}.'},]}]print(messages)completion = MultiModalConversation.call(model='qwen-vl-plus', messages=messages)# 检查API调用是否成功if completion is None:print("API调用返回None，可能请求失败或网络问题")return "错误：API调用失败，返回None"if completion.status_code != 200:print(f"API调用失败: {completion.status_code}, {completion.message}")return f"错误: {completion.message}"# 正确处理响应try:response = completion.output.choices[0]['message']['content'][0]['text']print(f'response={response}')return responseexcept Exception as e:print(f"解析响应时出错: {e}")return f"错误：无法解析响应，{str(e)}"df = pd.read_excel(f'{absolute_path}\\prompt_template_cn.xlsx')
df['response'] = ''
for index, row in df.iterrows():user_prompt = row['prompt']image_url = row['image']print(f"user_prompt:{user_prompt}")print(f"image_url:{image_url}")# 得到VLM推理结果result = get_response(user_prompt, image_url)# 检查返回结果是否为错误信息if isinstance(result, str) and result.startswith("错误"):response = resultelse:# 如果不是错误信息，则尝试提取响应内容try:response = resultexcept Exception as e:response = f"处理响应时出错: {str(e)}"print(f"response:{response}")df.loc[index, 'response'] = response#print(f"{index+1} {user_prompt} {image_url}")
df