利用OpenVINO™ Day0快速部署端侧可用的MiniCPM-V4.0视觉大模型
MiniCPM-V4.0是MiniCPM-V系列中最新的高效模型,参数总量为4B。该模型在 OpenCompass评测中图像理解能力超越了GPT-4.1-mini-20250414、Qwen2.5-VL-3B-Instruct和InternVL2.5-8B。凭借小巧的参数规模和高效的架构,MiniCPM-V4.0是移动端部署的理想选择。
OpenVINO™作为一个跨平台的深度学习模型部署工具,可以极大优化大语言的模型的推理性能,在充分激活硬件算力同时,降低对于内存资源的占用。本文将介绍如何利用OpenVINO™工具套件在本地部署MiniCPM-V4.0模型。
内容列表
1. 环境准备
2. 模型下载和转换
3. 模型部署
第一步,环境准备
通过以下命令可以搭建基于Python的模型部署环境。
python -m venv py_venv
./py_venv/Scripts/activate.bat
pip install -q "torch>=2.1" "torchvision" "timm>=0.9.2" "transformers>=4.45" "Pillow" "gradio>=4.40" "tqdm" "sentencepiece" "peft" "huggingface-hub>=0.24.0" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q "nncf>=2.14.0"
pip install -q "git+https://github.com/openvino-dev-samples/optimum-intel.git@minicpm4v" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q "git+https://github.com/openvino-dev-samples/optimum.git@minicpm4v" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q -U --pre "openvino>=2025.0" "openvino-tokenizers>=2025.0" "openvino-genai>=2025.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
第二步,模型下载和转换
在部署模型之前,我们首先需要将原始的PyTorch模型转换为OpenVINO™的IR静态图格式,并对其进行压缩,以实现更轻量化的部署和最佳的性能表现。借助Optimum提供的命令行工具optimum-cli,我们可以一键完成模型的格式转换任务:
optimum-cli export openvino --model openbmb/MiniCPM-V-4 MiniCPM-V-4-ov --trust-remote-code --weight-format fp16 --task image-text-to-text
其中,“—model” 为模型在HuggingFace上的model id,这里我们也提前下载原始模型,并将model id替换为原始模型的本地路径,针对国内开发者,推荐使用ModelScope魔搭社区作为原始模型的下载渠道,具体加载方式可以参考ModelScope官方指南:https://www.modelscope.cn/docs/models/download
转换完成后,optimum-cli工具会把原始模型拆分在4个子模型。他们分别是: language_model,resampler_model,text_embedding_model和vision_embedding_model, 其中作为基座模型的language_model资源占用最大,因此在模型转换后,我们可以单独对其进行低精度量化,以追求更高的推理性能。通过OpenVINO™的NNCF量化工具,可以快速进行模型文件里的离线量化,量化后,我们可以得到INT4精度的OpenVINO™模型,包含.xml和.bin两个文件,相较FP16精度的模型文件,INT4精度模型尺寸仅为其1/4左右。
-
NNCF模型量化示例
import nncf
import openvino as ov
compression_configuration = {"mode": nncf.CompressWeightsMode.INT4_SYM, "group_size": 64, "ratio": 1.0, "all_layers": True}
ov_model_path = model_dir / "openvino_language_model.xml"
ov_int4_model_path = model_dir / "openvino_language_model_int4.xml"
ov_model = ov.Core().read_model(ov_model_path)
ov_compressed_model = nncf.compress_weights(ov_model, **compression_configuration)
第三步,模型部署
目前我们推荐是用openvino-genai来部署大语言以及生成式AI任务,它同时支持Python和C++两种编程语言,安装容量不到200MB,支持流式输出以及多种采样策略。
-
GenAI API部署示例
import openvino_genai as ov_genai
ov_model = ov_genai.VLMPipeline(model_dir, device=device.value)
config = ov_genai.GenerationConfig()
config.max_new_tokens = 100
def load_image(image_file):
if isinstance(image_file, str) and (image_file.startswith("http") or image_file.startswith("https")):
response = requests.get(image_file)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image_file).convert("RGB")
image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.byte)
return image, ov.Tensor(image_data)
def streamer(subword: str) -> bool:
"""
Args:
subword: sub-word of the generated text.
Returns: Return flag corresponds whether generation should be stopped.
"""
print(subword, end="", flush=True)
image, image_tensor = load_image(image_path)
question = "What is unusual on this image?"
ov_model.start_chat()
output = ov_model.generate(question, image=image_tensor, generation_config=config, streamer=streamer)
openvino-genai提供了chat模式的构建方法,通过声明pipe.start_chat()以及pipe.finish_chat(),多轮聊天中的历史数据将被以kvcache的形态,在内存中进行管理,从而提升运行效率。
以下为该模型在图像理解任务中的输出示例:
The unusual aspect of this image is the cat's relaxed and vulnerable position. Typically, cats avoid exposing their bellies, which are sensitive and vulnerable areas, to potential threats. In this image, the cat is lying on its back in a cardboard box, exposing its belly and hindquarters, which is not a common sight. This behavior could indicate that the cat feels safe and comfortable in its environment, suggesting a strong bond with its owner and a sense of security in its home.
完整示例代码可以查看该notebook PR:https://github.com/openvinotoolkit/openvino_notebooks/pull/3047
其中附带的Gradio demo演示效果如下:
MiniCPM-V4.0
可以看到在仅需不到5GB内存占用的情况下,该模型便可以在Intel iGPU平台上流畅运行。
总结
利用OpenVINO™工具套件,开发者可以非常轻松地将转换后的MiniCP-V系列模型部署在Intel的硬件平台上,从而进一步在本地构建起各类基于LLM的服务和应用。
参考资料
-
openvino-genai仓库: https://github.com/openvinotoolkit/openvino.genai
-
魔搭社区OpenVINO™专区:https://www.modelscope.cn/organization/OpenVINO™
-
OpenVINO™ Model Hub:https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/model-hub.html