vllm量化02—awq
本系列基于Qwen2.5-7B,学习如何使用vllm量化,并使用benchmark_serving.py、lm_eval 测试模型性能和评估模型准确度。
测试环境为:
OS: centos 7
GPU: nvidia l40
driver: 550.54.15
CUDA: 12.3
本文是该系列第2篇——awq量化
一、量化
使用AutoAWQ 量化
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizermodel_path = "./Qwen2.5-7B"
quant_path = "./Qwen2.5-7B-awq-int4"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=False)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)# Quantize
model.quantize(tokenizer, quant_config=quant_config)# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"')
二、部署
vllm serve Qwen2.5-7B-awq-int4 --disable-log-requests --quantization awq --dtype="half"
三、benchmark
python /vllm/benchmarks/benchmark_serving.py --backend vllm --model Qwen2.5-7B-awq-int4 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
结果:
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 48.32
Total input tokens: 23260
Total generated tokens: 21979
Request throughput (req/s): 2.07
Output token throughput (tok/s): 454.85
Total Token throughput (tok/s): 936.22
---------------Time to First Token----------------
Mean TTFT (ms): 2132.58
Median TTFT (ms): 2277.96
P99 TTFT (ms): 3749.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 95.75
Median TPOT (ms): 69.59
P99 TPOT (ms): 309.96
---------------Inter-token Latency----------------
Mean ITL (ms): 65.81
Median ITL (ms): 59.83
P99 ITL (ms): 311.10
四、lm_eval
4.1 gsm8k
lm_eval --model vllm \ --model_args pretrained="Qwen2.5-7B-awq-int4",add_bos_token=true,gpu_memory_utilization=0.5,quantization="AWQ",dtype="half" \--tasks mmlu \--num_fewshot 5 \--limit 250
结果:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.832 | ± | 0.0237 |
strict-match | 5 | exact_match | ↑ | 0.740 | ± | 0.0278 |
4.2 mmlu
lm_eval --model vllm \--model_args pretrained="./Qwen2.5-7B-awq-int4/",add_bos_token=true,gpu_memory_utilization=0.5,quantization="AWQ",dtype="half" \--tasks mmlu \--num_fewshot 5 \--limit 250 \--batch_size 'auto'
结果:
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7517 | ± | 0.0041 | |
- humanities | 2 | none | acc | ↑ | 0.7576 | ± | 0.0082 | |
- other | 2 | none | acc | ↑ | 0.7526 | ± | 0.0084 | |
- social sciences | 2 | none | acc | ↑ | 0.8285 | ± | 0.0077 | |
- stem | 2 | none | acc | ↑ | 0.6866 | ± | 0.0083 |