当前位置：首页 > news >正文

AutoAWQ - 易用的 4 位量化模型工具

news 2025/9/10 17:54:27

文章目录

- 一、关于 AutoAWQ
- 二、安装
- - 1、前置条件
  - 2、从 PyPi 安装
- 三、使用方法
- - 1、INT4 GEMM vs INT4 GEMV vs FP16
  - - 计算密集型 vs 内存密集型
  - 2、混合模块
  - 3、示例
  - - 量化
    - 推理
- 四、性能基准
- - 2、Multi-GPU
  - 3、CPU
- 参考

一、关于 AutoAWQ

AutoAWQ 是一个易于使用的针对 4 位量化模型的包。

与 FP16 相比，AutoAWQ 可以将模型加速 3 倍，并将内存需求减少 3 倍。

AutoAWQ 实现了激活感知权重量化 (AWQ) 算法来量化 LLMs。

AutoAWQ 是基于 MIT 的 original work 创建并改进的。

github : https://github.com/casper-hansen/AutoAWQ
Roadmap : https://github.com/casper-hansen/AutoAWQ/issues/32
Examples : https://github.com/casper-hansen/AutoAWQ/tree/main/examples
HF awq 量化模型：https://huggingface.co/models?search=awq

最新消息 🔥

[2024/06] CPU推理支持（x86）- 感谢Intel。Cohere和Phi3支持。
[2024/04] StableLM 和 StarCoder2 支持。
[2024/03] Gemma 支持。
[2024/02] FP16 中的 PEFT 兼容训练。
[2024/02] 通过 ExLlamaV2 内核支持 AMD ROCm。
[2024/01] 导出到 GGUF，ExLlamaV2 内核，60% 更快的上下文处理。
[2023/12] Mixtral, LLaVa, QWen, Baichuan 模型支持。
[2023/11] AutoAWQ推理已集成到🤗 transformers。现在包括CUDA 12.1 wheels。
[2023/10] Mistral (Fused Modules), Bigcode, Turing 支持功能，内存错误修复（节省 2GB VRAM）
[2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
[2023/09] 支持多GPU，修复了错误，并提供更好的基准测试脚本
[2023/08] PyPi 软件包发布，AutoModel 类可用

二、安装

1、前置条件

NVIDIA:
- NVIDIA GPU必须是 Compute Capability 7.5。图灵及更新的架构均受支持。
- CUDA版本需为11.8+。
AMD:
- 您的 ROCm 版本需要与Triton兼容。
Intel CPU and Intel GPU:
- 为了达到最佳性能，您的 torch 和 intel_extension_for_pytorch 的版本至少应为2.4。
- 或者，您可以选择使用Triton内核为GPU提供支持，这需要您安装intel-xpu-backend-for-triton，以及兼容的torch和transformers。最便捷的方式是使用预先构建的软件包，详情请参见这里。

2、从 PyPi 安装

AutoAWQ 的安装方法有几种：

默认：

pip install autoawq
注意：默认安装不包括外部内核，并依赖于 Triton 进行推理。

从带有内核的版本发布：

pip install autoawq[kernels]
注意：这需要您匹配内核构建时使用的最新 torch 版本。
注意：这将安装 https://github.com/casper-hansen/AutoAWQ_kernels

从主分支针对Intel CPU和Intel XPU进行优化以提升性能：

pip install autoawq[cpu]
注意：需要至少 torch 2.4.0 版本。

三、使用方法

在“示例”部分，您可以找到如何量化、运行推理和基准测试AutoAWQ模型的示例。

1、INT4 GEMM vs INT4 GEMV vs FP16

AWQ 有两个版本：GEMM 和 GEMV。这两个名字都与矩阵乘法在底层是如何运行的相关。我们建议以下内容：

GEMV (量化): 比GEMM快20%，仅适用于批量大小为1（不适合大上下文）。
GEMM (量化): 在批处理大小低于8的情况下比FP16快得多（与大型上下文配合良好）。
FP16 (非量化): 推荐用于最高吞吐量：vLLM.

计算密集型 vs 内存密集型

在小批量大小和小的7B模型的情况下，我们是受内存限制的。

这意味着我们受限于GPU推送内存中权重时的带宽，这实际上限制了每秒可以生成的标记数量。

受内存限制是量化模型更快的原因，因为你的权重是3倍更小，因此可以在内存中更快地移动。

这与受计算限制不同，在生成过程中主要花费时间在做矩阵乘法。
在计算受限的场景下，这种情况发生在较大的批量大小中，使用W4A16量化模型并不会获得速度提升，因为反量化带来的开销会减慢整体生成速度。

这是因为AWQ量化模型仅在INT4中存储权重，但在推理过程中执行FP16操作，因此我们实际上在推理过程中将INT4 -> FP16进行转换。

2、混合模块

融合模块是您从AutoAWQ中获得的速度提升的重要组成部分。

其想法是将多个层组合成一个单一操作，从而变得更加高效。

融合模块代表了一组独立于Huggingface模型工作的自定义模块。

它们与model.generate()和其他Huggingface方法兼容，如果您激活了融合模块，这将带来一些在使用模型时的灵活性限制：

当你使用 fuse_layers=True 时，融合模块会被激活。
实现了一个自定义缓存。它根据批大小和序列长度预分配。
您在创建模型后无法更改序列长度。
参考信息：AutoAWQForCausalLM.from_quantized(max_seq_len=seq_len, batch_size=batch_size)
主模块中的主要加速器来自FasterTransformer，它仅兼容Linux。
model.generate() 返回的 past_key_values 仅是虚拟值，因此生成后不能使用。

3、示例

更多示例可以在示例目录中找到。

量化

预计在较小的7B模型上需要10-15分钟，而对于70B模型则需要大约1小时。

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

推理

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
from awq.utils.utils import get_best_device

device = get_best_device()

quant_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokens = tokenizer(
    prompt_template.format(prompt=prompt), 
    return_tensors='pt'
).input_ids.to(device)

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_seq_len=512
)

四、性能基准

这些基准测试展示了处理上下文（预填充）和生成令牌（解码）的速度和内存使用情况。结果包括不同批处理大小和AWQ内核不同版本的速率。

我们旨在使用与您用来重现结果的相同基准测试工具公平地测试模型。

请注意，速度不仅可能因GPU而异，也可能因CPU而异。最重要的是，需要拥有高内存带宽的GPU和高单核时钟速度的CPU。

使用 AutoAWQ 版本 0.1.6 进行测试
GPU: RTX 4090 (AMD Ryzen 9 7950X)
命令：python examples/benchmark.py --model_path <hf_model> --batch_size 1
🟢 for GEMV, 🔵 for GEMM, 🔴 for avoid using

Model Name	Size	Version	Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
Vicuna	7B	🟢GEMV	1	64	64	639.65	198.848	4.50 GB (19.05%)
Vicuna	7B	🟢GEMV	1	2048	2048	1123.63	133.191	6.15 GB (26.02%)
…	…	…	…	…	…	…	…	…
Mistral	7B	🔵GEMM	1	64	64	1093.35	156.317	4.35 GB (18.41%)
Mistral	7B	🔵GEMM	1	2048	2048	3897.02	114.355	5.55 GB (23.48%)
Mistral	7B	🔵GEMM	8	64	64	4199.18	1185.25	4.35 GB (18.41%)
Mistral	7B	🔵GEMM	8	2048	2048	3661.46	829.754	16.82 GB (71.12%)
…	…	…	…	…	…	…	…	…
Mistral	7B	🟢GEMV	1	64	64	531.99	188.29	4.28 GB (18.08%)
Mistral	7B	🟢GEMV	1	2048	2048	903.83	130.66	5.55 GB (23.48%)
Mistral	7B	🔴GEMV	8	64	64	897.87	486.46	4.33 GB (18.31%)
Mistral	7B	🔴GEMV	8	2048	2048	884.22	411.893	16.82 GB (71.12%)
…	…	…	…	…	…	…	…	…
TinyLlama	1B	🟢GEMV	1	64	64	1088.63	548.993	0.86 GB (3.62%)
TinyLlama	1B	🟢GEMV	1	2048	2048	5178.98	431.468	2.10 GB (8.89%)
…	…	…	…	…	…	…	…	…
Llama 2	13B	🔵GEMM	1	64	64	820.34	96.74	8.47 GB (35.83%)
Llama 2	13B	🔵GEMM	1	2048	2048	2279.41	73.8213	10.28 GB (43.46%)
Llama 2	13B	🔵GEMM	3	64	64	1593.88	286.249	8.57 GB (36.24%)
Llama 2	13B	🔵GEMM	3	2048	2048	2226.7	189.573	16.90 GB (71.47%)
…	…	…	…	…	…	…	…	…
MPT	7B	🔵GEMM	1	64	64	1079.06	161.344	3.67 GB (15.51%)
MPT	7B	🔵GEMM	1	2048	2048	4069.78	114.982	5.87 GB (24.82%)
…	…	…	…	…	…	…	…	…
Falcon	7B	🔵GEMM	1	64	64	1139.93	133.585	4.47 GB (18.92%)
Falcon	7B	🔵GEMM	1	2048	2048	2850.97	115.73	6.83 GB (28.88%)
…	…	…	…	…	…	…	…	…
CodeLlama	34B	🔵GEMM	1	64	64	681.74	41.01	19.05 GB (80.57%)
CodeLlama	34B	🔵GEMM	1	2048	2048	1072.36	35.8316	20.26 GB (85.68%)
…	…	…	…	…	…	…	…	…
DeepSeek	33B	🔵GEMM	1	64	64	1160.18	40.29	18.92 GB (80.00%)
DeepSeek	33B	🔵GEMM	1	2048	2048	1012.1	34.0093	19.87 GB (84.02%)

2、Multi-GPU

GPU: 2x NVIDIA GeForce RTX 4090

Model	Size	Version	Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
Mixtral	46.7B	🔵GEMM	1	32	32	149.742	93.406	25.28 GB (53.44%)
Mixtral	46.7B	🔵GEMM	1	64	64	1489.64	93.184	25.32 GB (53.53%)
Mixtral	46.7B	🔵GEMM	1	128	128	2082.95	92.9444	25.33 GB (53.55%)
Mixtral	46.7B	🔵GEMM	1	256	256	2428.59	91.5187	25.35 GB (53.59%)
Mixtral	46.7B	🔵GEMM	1	512	512	2633.11	89.1457	25.39 GB (53.67%)
Mixtral	46.7B	🔵GEMM	1	1024	1024	2598.95	84.6753	25.75 GB (54.44%)
Mixtral	46.7B	🔵GEMM	1	2048	2048	2446.15	77.0516	27.98 GB (59.15%)
Mixtral	46.7B	🔵GEMM	1	4096	4096	1985.78	77.5689	34.65 GB (73.26%)

3、CPU

CPU: 48 cores SPR (Intel 4th Gen Xeon CPU)
Command:

python examples/benchmark.py --model_path <hf_model> --batch_size 1 --generator hf

Model	Version	Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory
TinyLlama 1B	gemm	1	32	32	817.86	70.93	1.94 GB (0.00%)
TinyLlama 1B	gemm	1	2048	2048	5279.15	36.83	2.31 GB (0.00%)
Falcon 7B	gemm	1	32	32	337.51	26.41	9.57 GB (0.01%)
Falcon 7B	gemm	1	2048	2048	546.71	18.8	13.46 GB (0.01%)
Mistral 7B	gemm	1	32	32	343.08	28.46	9.74 GB (0.01%)
Mistral 7B	gemm	1	2048	2048	1135.23	13.23	10.35 GB (0.01%)
Vicuna 7B	gemm	1	32	32	340.73	28.86	9.59 GB (0.01%)
Vicuna 7B	gemm	1	2048	2048	1143.19	11.14	10.98 GB (0.01%)
Llama 2 13B	gemm	1	32	32	220.79	18.14	17.46 GB (0.02%)
Llama 2 13B	gemm	1	2048	2048	650.94	6.54	19.84 GB (0.02%)
DeepSeek Coder 33B	gemm	1	32	32	101.61	8.58	40.80 GB (0.04%)
DeepSeek Coder 33B	gemm	1	2048	2048	245.02	3.48	41.72 GB (0.04%)
Phind CodeLlama 34B	gemm	1	32	32	102.47	9.04	41.70 GB (0.04%)
Phind CodeLlama 34B	gemm	1	2048	2048	237.57	3.48	42.47 GB (0.04%)

参考

如果您觉得AWQ对您的研究有用或相关，您可以引用他们的论文：

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

2025-03-08(六)
🌹节日快乐 --<-<-<@