当前位置: 首页 > news >正文

vllm量化02—awq

本系列基于Qwen2.5-7B,学习如何使用vllm量化,并使用benchmark_serving.py、lm_eval 测试模型性能和评估模型准确度。
测试环境为:

OS: centos 7
GPU: nvidia l40
driver: 550.54.15
CUDA: 12.3

本文是该系列第2篇——awq量化

一、量化

使用AutoAWQ 量化

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizermodel_path = "./Qwen2.5-7B"
quant_path = "./Qwen2.5-7B-awq-int4"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=False)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)# Quantize
model.quantize(tokenizer, quant_config=quant_config)# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"')

二、部署

 vllm  serve Qwen2.5-7B-awq-int4 --disable-log-requests --quantization awq --dtype="half"

三、benchmark

python /vllm/benchmarks/benchmark_serving.py --backend vllm --model Qwen2.5-7B-awq-int4 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 100

结果:

============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  48.32
Total input tokens:                      23260
Total generated tokens:                  21979
Request throughput (req/s):              2.07
Output token throughput (tok/s):         454.85
Total Token throughput (tok/s):          936.22
---------------Time to First Token----------------
Mean TTFT (ms):                          2132.58
Median TTFT (ms):                        2277.96
P99 TTFT (ms):                           3749.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          95.75
Median TPOT (ms):                        69.59
P99 TPOT (ms):                           309.96
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81
Median ITL (ms):                         59.83
P99 ITL (ms):                            311.10

四、lm_eval

4.1 gsm8k

lm_eval --model vllm \ --model_args pretrained="Qwen2.5-7B-awq-int4",add_bos_token=true,gpu_memory_utilization=0.5,quantization="AWQ",dtype="half"  \--tasks mmlu  \--num_fewshot 5 \--limit 250

结果:

TasksVersionFiltern-shotMetricValueStderr
gsm8k3flexible-extract5exact_match0.832±0.0237
strict-match5exact_match0.740±0.0278

4.2 mmlu

lm_eval --model vllm  \--model_args pretrained="./Qwen2.5-7B-awq-int4/",add_bos_token=true,gpu_memory_utilization=0.5,quantization="AWQ",dtype="half"  \--tasks mmlu \--num_fewshot 5 \--limit 250 \--batch_size 'auto'

结果:

GroupsVersionFiltern-shotMetricValueStderr
mmlu2noneacc0.7517±0.0041
- humanities2noneacc0.7576±0.0082
- other2noneacc0.7526±0.0084
- social sciences2noneacc0.8285±0.0077
- stem2noneacc0.6866±0.0083

相关文章:

  • 自定义分区器-基础
  • typeof运算符和深拷贝
  • js白屏检测与白屏的修正机制
  • Pomelo知识框架
  • fiftyone-dataset使用基础
  • 猫眼浏览器:简约安全,极速浏览
  • java基础:异常体系
  • 2025五一杭州西湖三天游
  • Linux - 基础指令
  • 没经过我同意,flink window就把数据存到state里的了?
  • Linux基础 -- SSH 流式烧录与压缩传输笔记
  • Windows避坑部署CosyVoice多语言大语言模型
  • elasticdump备份恢复
  • 内存泄漏系列专题分析之十四:高通相机CamX ION/dmabuf内存管理机制ImageBuffer之GrallocBuffer原理
  • 大二java第一面小厂(挂)
  • Beats
  • IP地址查询助力业务增长
  • Cancer Discov (IF:30.6)|中山一院于君/匡铭合作解析瘤内微生物的异质性和促肿瘤机制
  • 第一章:人工智能概述
  • 解放双手的鼠标自动点击软件
  • 中日东三省问题的源起——《1905年东三省事宜谈判笔记》解题
  • 奥迪车加油时频繁“跳枪”维修两年未解决,4S店拒退换:可延长质保
  • 日本航空自卫队一架练习机在爱知县坠毁
  • “异常”只停留在医院里,用艺术为“泡泡宝贝”加油
  • 港股持续拉升:恒生科技指数盘中涨幅扩大至6%,恒生指数涨3.3%
  • 署名文章:从宏观调控看中国经济基本面