当前位置：首页 > news >正文

第四十九篇-Tesla P40+Fastllm+Hunyuan-A13B-Instruct+CPU+GPU混合部署推理

news 2025/9/9 11:30:21

环境

	系统：CentOS-7CPU : E5-2680V4 14核28线程内存：DDR4 2133 32G * 2显卡：Tesla P40 24G驱动: 535CUDA: 12.2

Fastllm亮点功能

	🚀 安装使用简单方便，一条命令就能成功安装，一条命令就能成功运行。🚀 支持CPU + GPU混合推理MOE大参数模型（单显卡即可推理DEEPSEEK 671B）。🚀 使用C++实现自有底层算子，不依赖PyTorch。🚀 兼容性好，PIP安装支持可以支持到P100、MI50等老卡，源码安装支持更多设备。🚀 支持多卡张量并行推理，支持3、5、7等奇数张卡。🚀 支持GPU + CPU混合张量并行推理🚀 支持CPU和显卡实现FP8运算，老设备也可以运行🚀 支持多CPU加速，且只占用1份内存🚀 支持ROCM，AMD GPU；支持天数，沐曦，燧原；支持华为昇腾。🚀 支持动态Batch，流式输出；前后端分离设计，可跨平台移植，可在安卓上直接编译。🚀 支持Python自定义模型结构

使用Fastllm

	 支持Tesla P40这类老显卡模型 Hunyuan-A13B-Instruct MOE 参数量小

第一步下载 Hunyuan-A13B-Instruct

    https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct目录宿主机 /local_models/Hunyuan-A13B-Instruct

启动容器(手动型)

    docker run -itd -p 38080:8080 -v /local_models:/models --gpus all --name cuda_fastllm_12 nvidia/cuda:12.1.0-devel-ubuntu22.04

容器中安装fastllm(手动型)

    0.进入容器docker exec -it -u root cuda_fastllm_12 bash2.安装pip等相关依赖apt-get update apt-get -y --no-install-recommends install wget build-essential python3.10 python3-pipupdate-alternatives --install /usr/bin/python  python /usr/bin/python3.10 1pip install setuptools streamlit-chat -i https://mirrors.aliyun.com/pypi/simple2.安装fasllmpip install ftllm -U -i https://mirrors.aliyun.com/pypi/simple

容器中运行fastllm(手动型)

    ftllm webui /models/Hunyuan-A13B-Instruct --dtype int4 --port 8080 --device cuda --moe_device cpu

访问测试

    后台Running get_model() 等等模型加载Load libnuma.so.1CPU Instruction Info: [AVX512F: OFF] [AVX512_VNNI: OFF] [AVX512_BF16: OFF] Load libfastllm_tools.soLoading 100 Warmup...finish.加载完成前台http://192.168.31.222:38080/可以对话了

显存占用

Mon Sep  8 21:01:50 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:03:00.0 Off |                  Off |
| N/A   36C    P0              65W / 250W |   2952MiB / 24576MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2877      C   /usr/bin/python3                           2950MiB |
+---------------------------------------------------------------------------------------+

内存占用

45.5G

CPU占用

2800%

速度

alive = 1, pending = 0, contextLen = 128, Speed: 4.943357 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.951659 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.941317 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.935721 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.921079 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.928165 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.907796 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.910485 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.904030 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.897908 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.886713 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.897940 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.877572 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.885381 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.879317 tokens / s.
alive = 1, pending = 0, contextLen = 128, Speed: 4.873287 tokens / s.