第四十九篇-Tesla P40+Fastllm+Hunyuan-A13B-Instruct+CPU+GPU混合部署推理
环境
系统:CentOS-7CPU : E5-2680V4 14核28线程内存:DDR4 2133 32G * 2显卡:Tesla P40 24G驱动: 535CUDA: 12.2
Fastllm亮点功能
🚀 安装使用简单方便,一条命令就能成功安装,一条命令就能成功运行。🚀 支持CPU + GPU混合推理MOE大参数模型(单显卡即可推理DEEPSEEK 671B)。🚀 使用C++实现自有底层算子,不依赖PyTorch。🚀 兼容性好,PIP安装支持可以支持到P100、MI50等老卡,源码安装支持更多设备。🚀 支持多卡张量并行推理,支持3、5、7等奇数张卡。🚀 支持GPU + CPU混合张量并行推理🚀 支持CPU和显卡实现FP8运算,老设备也可以运行🚀 支持多CPU加速,且只占用1份内存🚀 支持ROCM,AMD GPU;支持天数,沐曦,燧原;支持华为昇腾。🚀 支持动态Batch,流式输出;前后端分离设计,可跨平台移植,可在安卓上直接编译。🚀 支持Python自定义模型结构
使用Fastllm
支持Tesla P40这类老显卡模型 Hunyuan-A13B-Instruct MOE 参数量小
第一步下载 Hunyuan-A13B-Instruct
https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct目录宿主机 /local_models/Hunyuan-A13B-Instruct
启动容器(手动型)
docker run -itd -p 38080:8080 -v /local_models:/models --gpus all --name cuda_fastllm_12 nvidia/cuda:12.1.0-devel-ubuntu22.04
容器中安装fastllm(手动型)
0.进入容器docker exec -it -u root cuda_fastllm_12 bash2.安装pip等相关依赖apt-get update apt-get -y --no-install-recommends install wget build-essential python3.10 python3-pipupdate-alternatives --install /usr/bin/python python /usr/bin/python3.10 1pip install setuptools streamlit-chat -i https://mirrors.aliyun.com/pypi/simple2.安装fasllmpip install ftllm -U -i https://mirrors.aliyun.com/pypi/simple
容器中运行fastllm(手动型)
ftllm webui /models/Hunyuan-A13B-Instruct --dtype int4 --port 8080 --device cuda --moe_device cpu
访问测试
后台Running get_model() 等等模型加载Load libnuma.so.1CPU Instruction Info: [AVX512F: OFF] [AVX512_VNNI: OFF] [AVX512_BF16: OFF] Load libfastllm_tools.soLoading 100 Warmup...finish.加载完成前台http://192.168.31.222:38080/可以对话了
显存占用
Mon Sep 8 21:01:50 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 Off | 00000000:03:00.0 Off | Off |
| N/A 36C P0 65W / 250W | 2952MiB / 24576MiB | 10% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2877 C /usr/bin/python3 2950MiB |
+---------------------------------------------------------------------------------------+
内存占用
45.5G
CPU占用
2800%
速度
alive = 1, pending = 0, contextLen = 128, Speed: 4.943357 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.951659 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.941317 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.935721 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.921079 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.928165 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.907796 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.910485 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.904030 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.897908 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.886713 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.897940 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.877572 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.885381 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.879317 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.873287 tokens / s.
总结
速度较慢,不过P40这种洋垃圾已经可以了。