当前位置：首页 > news >正文

阿里开源Qwen3-Omni-30B-A3B三剑客——Instruct、Thinking 和 Captioner

news 2025/10/11 5:31:44

在这里插入图片描述
Qwen3-Omni是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频，并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能与效率。核心特性：

全模态顶尖表现：早期文本优先预训练与混合多模态训练提供原生多模态支持。在实现强劲音频与音视频表现的同时，单模态文本和图像性能无倒退。在36项音频/视频基准测试中22项达到SOTA，其中32项为开源SOTA；语音识别(ASR)、音频理解及语音对话表现可比肩Gemini 2.5 Pro。
多语言支持：涵盖119种文本语言、19种语音输入语言及10种语音输出语言。
- 语音输入：英语、中文、韩语、日语、德语、俄语、意大利语、法语、西班牙语、葡萄牙语、马来语、荷兰语、印尼语、土耳其语、越南语、粤语、阿拉伯语、乌尔都语。
- 语音输出：英语、中文、法语、德语、俄语、意大利语、西班牙语、葡萄牙语、日语、韩语。
创新架构：基于混合专家(MoE)的Thinker-Talker设计配合AuT预训练实现强泛化表征，辅以多码本设计将延迟压至最低。
实时音视频交互：低延迟流式响应，支持自然轮转与即时文本/语音反馈。
灵活控制：通过系统提示词定制行为，实现细粒度控制与轻松适配。
高细节音频描述器：现已开源Qwen3-Omni-30B-A3B-Captioner——通用型、高细节、低幻听音频描述模型，填补了开源社区关键空白。

模型架构

在这里插入图片描述

快速开始

模型描述与下载

以下是所有Qwen3-Omni模型的描述。请根据需求选择并下载适合的模型。

模型名称	描述
Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B的指令微调模型，包含思考者与对话者模块，支持音频、视频及文本输入，可输出音频与文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-30B-A3B的思维模型，包含思考者组件，具备思维链推理能力，支持音频、视频及文本输入，输出为文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Captioner	基于Qwen3-Omni-30B-A3B-Instruct微调的下游音频细粒度描述模型，可为任意音频输入生成细节丰富、低幻觉的文本描述。包含思考者模块，支持音频输入与文本输出。更多信息可参考模型使用指南。

在Hugging Face Transformers或vLLM加载时，系统会根据模型名称自动下载权重文件。若运行环境不便于执行时下载权重，可参考以下命令将模型权重手动下载至本地目录：

# Download through ModelScope (recommended for users in Mainland China)
pip install -U modelscope
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner# Download through Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./Qwen3-Omni-30B-A3B-Thinking
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner

Transformers 使用指南

安装说明

Qwen3-Omni的Hugging Face Transformers代码已成功合并，但PyPI包尚未发布。因此，您需要通过以下命令从源码安装。我们强烈建议您新建Python环境以避免出现环境运行问题。

# If you already have transformers installed, please uninstall it first, or create a new Python environment
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate

我们提供一套工具包，帮助您更方便地处理各类音视频输入，提供类似API的体验。这包括支持base64、URL以及交错的音频、图像和视频。您可以通过以下命令安装，并确保系统已安装ffmpeg：

pip install qwen-omni-utils -U

此外，我们建议在运行Hugging Face Transformers时使用FlashAttention 2以减少GPU内存占用。但若您主要使用vLLM进行推理，则无需安装此功能，因为vLLM默认已集成FlashAttention 2。

pip install -U flash-attn --no-build-isolation

此外，您需要确保硬件兼容 FlashAttention 2。更多信息请查阅 FlashAttention 代码库的官方文档。FlashAttention 2 仅当模型以 torch.float16 或 torch.bfloat16 格式加载时可用。

代码片段

以下是展示如何使用Qwen3-Omni搭配transformers和qwen_omni_utils的代码片段：

import soundfile as sffrom transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_infoMODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(MODEL_PATH,dtype="auto",device_map="auto",attn_implementation="flash_attention_2",
)processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)conversation = [{"role": "user","content": [{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},{"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},{"type": "text", "text": "What can you see and hear? Answer in one short sentence."}],},
]# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, speaker="Ethan", thinker_return_dict_in_generate=True,use_audio_in_video=USE_AUDIO_IN_VIDEO)text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],skip_special_tokens=True,clean_up_tokenization_spaces=False)
print(text)
if audio is not None:sf.write("output.wav",audio.reshape(-1).detach().cpu().numpy(),samplerate=24000,)

vLLM 使用指南

安装说明

我们强烈推荐使用vLLM进行Qwen3-Omni系列模型的推理与部署。由于当前我们的代码尚处于合并请求阶段，且指令模型的音频输出推理支持即将在近期发布，您可通过以下命令从源码安装vLLM。请注意，我们建议您新建Python虚拟环境以避免运行环境冲突与兼容性问题。更多源码编译vLLM的细节，请参阅vLLM官方文档。

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

推理

您可以使用以下代码进行vLLM推理。参数limit_mm_per_prompt指定每条消息允许的每种模态数据的最大数量。由于vLLM需要预分配GPU内存，较大的值将需要更多GPU内存；如果出现OOM问题，请尝试减小此值。将tensor_parallel_size设置为大于1可启用多GPU并行推理，提高并发性和吞吐量。此外，max_num_seqs表示vLLM在每次推理步骤中并行处理的序列数。较大的值需要更多GPU内存，但可实现更高的批量推理速度。更多详情，请参阅vLLM官方文档。以下是一个使用vLLM运行Qwen3-Omni的简单示例：

import os
import torchfrom vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_infoif __name__ == '__main__':# vLLM engine v1 not supported yetos.environ['VLLM_USE_V1'] = '0'MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"llm = LLM(model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,tensor_parallel_size=torch.cuda.device_count(),limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},max_num_seqs=8,max_model_len=32768,seed=1234,)sampling_params = SamplingParams(temperature=0.6,top_p=0.95,top_k=20,max_tokens=16384,)processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)messages = [{"role": "user","content": [{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}], }]text = processor.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)audios, images, videos = process_mm_info(messages, use_audio_in_video=True)inputs = {'prompt': text,'multi_modal_data': {},"mm_processor_kwargs": {"use_audio_in_video": True,},}if images is not None:inputs['multi_modal_data']['image'] = imagesif videos is not None:inputs['multi_modal_data']['video'] = videosif audios is not None:inputs['multi_modal_data']['audio'] = audiosoutputs = llm.generate([inputs], sampling_params=sampling_params)print(outputs[0].outputs[0].text)

使用技巧（推荐阅读）

最低GPU内存要求

Model	Precision	15s Video	30s Video	60s Video	120s Video
Qwen3-Omni-30B-A3B-Instruct	BF16	78.85 GB	88.52 GB	107.74 GB	144.81 GB
Qwen3-Omni-30B-A3B-Thinking	BF16	68.74 GB	77.79 GB	95.76 GB	131.65 GB

注意：上表展示了使用transformers和BF16精度进行推理时的理论最低内存需求（测试时采用attn_implementation="flash_attention_2"）。Instruct模型包含思考者与讲述者双组件，而Thinking模型仅包含思考者部分。

音视频交互提示模版

当使用Qwen3-Omni进行音视频多模态交互时（输入为视频及对应音频，音频作为查询内容），我们推荐采用以下系统提示词。该设置可使模型在保持高推理能力的同时，更好地扮演智能助手等交互角色，且思考者生成的文本更具可读性——采用自然对话语气，不含难以发声的复杂格式，从而使讲述者输出更稳定流畅的音频。您可根据实际需求，在系统提示词的user_system_prompt字段中定制角色设定等描述信息。

user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
message = {"role": "system","content": [{"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age.\nYou are communicating with the user.\nIn user messages, “I/me/my/we/our” refer to the user and “you/your” refer to the assistant. In your replies, address the user as “you/your” and yourself as “I/me/my”; never mirror the user’s pronouns—always shift perspective. Keep original pronouns only in direct quotes; if a reference is unclear, ask a brief clarifying question.\nInteract with users using short(no more than 50 words), brief, straightforward language, maintaining a natural tone.\nNever use formal phrasing, mechanical expressions, bullet points, overly structured language. \nYour output must consist only of the spoken content you want the user to hear. \nDo not include any descriptions of actions, emotions, sounds, or voice changes. \nDo not use asterisks, brackets, parentheses, or any other symbols to indicate tone or actions. \nYou must answer users' audio or text questions, do not directly describe the video content. \nYou should communicate in the same language strictly as the user unless they request otherwise.\nWhen you are uncertain (e.g., you can't see/hear clearly, don't understand, or the user makes a comment rather than asking a question), use appropriate questions to guide the user to continue the conversation.\nKeep replies concise and conversational, as if talking face-to-face."}]
}

思维模型最佳实践

Qwen3-Omni-30B-A3B-Thinking模型主要针对多模态输入（包括文本、音频、图像和视频）的理解与交互设计。为获得最佳性能，我们建议用户在每轮对话中提交多模态输入时，附带明确的文字指令或任务描述。这有助于厘清意图，并显著提升模型调用推理能力的效果。例如：

messages = [{"role": "user","content": [{"type": "audio", "audio": "/path/to/audio.wav"},{"type": "image", "image": "/path/to/image.png"},{"type": "video", "video": "/path/to/video.mp4"},{"type": "text", "text": "Analyze this audio, image, and video together."},], }
]

在视频中使用音频

在多模态交互中，用户提供的视频通常伴随着音频（例如口述问题或视频中事件的声音）。这些信息有助于模型提供更好的交互体验。我们为用户提供以下选项，以决定是否使用视频中的音频。

# In data preprocessing
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

# For Transformers
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)
text_ids, audio = model.generate(..., use_audio_in_video=True)# For vLLM
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = {'prompt': text,'multi_modal_data': {},"mm_processor_kwargs": {"use_audio_in_video": True,},
}