【vLLM 最新版v0.10.2】docker运行openai服务与GGUF量化使用方式
【vLLM 最新版v0.10.2】docker运行openai服务与GGUF量化使用方式
- docker运行教程
- 启动服务
- 测试服务
- GGUF量化
- 准备llama.cpp环境
- 量化模型
- 测试量化后模型
注意:vLLM量化只是节约显存,并不会提高速度
docker运行教程
官方运行示例: https://docs.vllm.ai/en/latest/deployment/docker.html
官方镜像:https://hub.docker.com/r/vllm/vllm-openai/tags
启动服务
#拉取镜像
docker pull vllm/vllm-openai:v0.10.0#启动命令
docker run -idt --restart=always -e TZ="Asia/Shanghai" --gpus device=0 --name qwen2.5-test -v Qwen2.5-7B-Instruct:/Qwen2.5-7B-Instruct -p 8192:8000 --ipc=host vllm/vllm-openai:v0.10.2 --model /Qwen2.5-7B-Instruct --gpu-memory-utilization 0.90
测试服务
服务地址:http://{ip}:{port}/v1/chat/completions
POST请求:
{"model": "/Qwen2.5-7B-Instruct","messages": [{"role": "user","content": "介绍一下你自己"}],"max_tokens": 200,"stream": false
}
返回结果:
{"id": "chatcmpl-f8f9d2835b84420db9ec956d4f884320","object": "chat.completion","created": 1758552650,"model": "/Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/","choices": [{"index": 0,"message": {"role": "assistant","content": "您好!我叫Qwen,是阿里云推出的一种超大规模语言模型。我能够回答问题、创作文字,还能表达观点、撰写代码,是您的工作和生活中的得力助手。如果您有任何问题或需要帮助,请随时告诉我,我会尽力提供支持。","refusal": null,"annotations": null,"audio": null,"function_call": null,"tool_calls": [],"reasoning_content": null},"logprobs": null,"finish_reason": "stop","stop_reason": null,"token_ids": null}],"service_tier": null,"system_fingerprint": null,"usage": {"prompt_tokens": 31,"total_tokens": 89,"completion_tokens": 58,"prompt_tokens_details": null},"prompt_logprobs": null,"prompt_token_ids": null,"kv_transfer_params": null
}
GGUF量化
准备llama.cpp环境
github地址:https://github.com/ggml-org/llama.cpp
# 拉取仓库
git clone -b b6545 https://github.com/ggml-org/llama.cpp.git# 安装llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j20# 测试
cd build/bin
./llama-quantize --help
量化模型
# 将huggingface模型转为gguf类型
python3 convert_hf_to_gguf.py Qwen2.5-7B-Instruct --outtype f16 --outfile Qwen2.5-7B-Instruct-FP16.gguf# 将FP16的gguf模型量化为Q4_0模型
llama-quantize ./Qwen2.5-7B-Instruct-FP16.gguf ./Qwen2.5-7B-Instruct-Q4_0.gguf Q4_0
llama-quantize可支持的类型包含:
Allowed quantization types:2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B38 or MXFP4_MOE : MXFP4 MoE8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B19 or IQ2_XXS : 2.06 bpw quantization20 or IQ2_XS : 2.31 bpw quantization28 or IQ2_S : 2.5 bpw quantization29 or IQ2_M : 2.7 bpw quantization24 or IQ1_S : 1.56 bpw quantization31 or IQ1_M : 1.75 bpw quantization36 or TQ1_0 : 1.69 bpw ternarization37 or TQ2_0 : 2.06 bpw ternarization10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B23 or IQ3_XXS : 3.06 bpw quantization26 or IQ3_S : 3.44 bpw quantization27 or IQ3_M : 3.66 bpw quantization mix12 or Q3_K : alias for Q3_K_M22 or IQ3_XS : 3.3 bpw quantization11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B25 or IQ4_NL : 4.50 bpw non-linear quantization30 or IQ4_XS : 4.25 bpw non-linear quantization15 or Q4_K : alias for Q4_K_M14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B17 or Q5_K : alias for Q5_K_M16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B0 or F32 : 26.00G @ 7BCOPY : only copy tensors, no quantizing
测试量化后模型
启动容器:
docker run -idt --restart=always -e TZ="Asia/Shanghai" --gpus device=0 --name qwen2.5-test -v Qwen2.5-7B-Instruct-Q4_0:/Qwen2.5-7B-Instruct-Q4_0 -p 8192:8000 --ipc=host vllm/vllm-openai:v0.10.2 --model /Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf --gpu-memory-utilization 0.90
服务地址:http://{ip}:{port}/v1/chat/completions
POST请求:
{"model": "/Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf","messages": [{"role": "user","content": "介绍一下你自己"}],"max_tokens": 200,"stream": false
}
返回结果
{"id": "chatcmpl-60b9025597c346518e6fc73e8fb431db","object": "chat.completion","created": 1758553950,"model": "/Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf","choices": [{"index": 0,"message": {"role": "assistant","content": "我是阿里云开发的一种超大规模语言模型,我叫Qwen。作为一个AI助手,我的主要功能是生成各种类型的文本,比如文章、故事、诗歌、故事等,并能够根据与用户的对话内容回答问题、表达观点、提供帮助。作为一款来自中国的大规模语言模型,我希望能够用自然、流畅的方式与人类进行交流,理解并满足用户的需求。同时,我也在不断学习和进步,以更好地服务大家。如果您有任何问题或需要帮助,欢迎随时与我交流!","refusal": null,"annotations": null,"audio": null,"function_call": null,"tool_calls": [],"reasoning_content": null},"logprobs": null,"finish_reason": "stop","stop_reason": 151645,"token_ids": null}],"service_tier": null,"system_fingerprint": null,"usage": {"prompt_tokens": 31,"total_tokens": 139,"completion_tokens": 108,"prompt_tokens_details": null},"prompt_logprobs": null,"prompt_token_ids": null,"kv_transfer_params": null
}