当前位置：首页 > news >正文

YOLOv8 PTQ、QAT量化及其DepGraph剪枝等压缩与加速推理有效实现（含代码）

news 2025/7/18 11:40:31

本项目主要对YOLO模型进行压缩与加速工作并使用TensorRT进行加速推理，这里以yolov8-pose为例，其他模型是类似的，只是不同任务有一些差异！

项目代码目录结构

项目代码github地址：https://github.com/cquxl/EdgeLite

本项目的代码目录结构如下

EdgeLite/
├── compression/
│   ├── quant/
│   │   ├── ptq/
│   │   │   ├── ptq_quant.py      # 构建与导出主类
│   │   │   ├── utils.py                # 自定义训练数据
│   │   ├── qat/
│   │   │   ├── qat_quant.py      # 量化与导出主类
│   │   │   ├── utils.py                # 自定义校准器与加载器
│   ├── prune/
│   │   ├── prune.py                 # engine 模型评估工具
│   │   ├── utils.py           # engine 模型量化工具  
├── datasets/                           # 数据集及校准图像路径
├── weights/                             # YOLOv8 预训练模型路径
├── output/                             # 日志与输出路径
├── main_prune.py                        # 剪枝运行入口
├── main_quant.py                        # 量化运行入口
└── README.md                           # 项目说明文档

模型简介

YOLO-Pose模型

模型任务

目标检测–>检测框(x,y,w,h,框置信度)+类别置信度，类别只有0：person–>[B,N,5+C]
姿态估计–>人体姿态关键点(x,y,visiable), visiable表示是否可见，17个关键点

姿态估计是一项涉及识别图像中特定点（通常称为关键点）位置的任务。关键点可以代表物体的各个部分，如关节、地标或其他显著特征。关键点的位置通常用一组二维 [x, y] 或 3D [x, y, visible] 坐标

姿态估计模型的输出是一组代表图像中物体关键点的点，通常还包括每个点的置信度分数

yolov8有 17 个关键点，每个关键点代表人体的不同部位。

1.鼻子
2.左眼
3.右眼
4.左耳
5.右耳
6.左肩
7.右肩
8.左肘
9.右肘
10.左腕
11.右手腕
12.左髋关节
13.右髋关节
14.左膝
15.右膝盖
16.左脚踝
17.右脚踝

相关知识点

NMS非极大值抑制（Non-Maximum Suppression）

在目标检测中，模型通常会输出多个检测框（bounding boxes），这些检测框可能对应同一个目标。NMS 的作用是通过比较这些检测框的置信度（confidence score），选择置信度最高的检测框，并移除与其重叠较大的其他检测框。

一句话简单概括，就是选择置信度最大的检测框，并删除与其IOU计算重叠度较高（阈值）的检测框，对剩余的检测框重复该步骤
DLA (Deep Learning Accelerator)是 NVIDIA 为边缘设备或 SoC（如 Jetson 系列）设计的专用神经网络推理加速器

存在于某些 Jetson 平台 （如 Jetson Xavier NX / AGX Xavier）,相比 GPU 功耗更低 ，适合部署于功耗敏感场景

专门用于 深度学习模型推理任务支持部分 TensorRT 运算节点（有限的运算支持）

DLA 在执行 INT8 量化时 不支持全局 scale ，而要求：
- 每个激活张量的 scale （per-activation tensor scale）
- 所以需要使用 支持 per activation scaling 的校准器 ，如 IInt8EntropyCalibrator2
- DLA 需要对每个激活张量进行独立缩放（scale），而老的 calibrator (IInt8EntropyCalibrator)并不支持这一特性

压缩方法简介

本文主要在与应用压缩方法来对YOLO进行加速，因此不详细介绍各个压缩方法的具体原理，各压缩方法的详细介绍可以参考https://github.com/liguodongiot/llm-action

剪枝：包括非结构化、结构化与半结构化剪枝

非结构化：消除冗余的权重
结构化：消除通道/神经单元/层
半结构化：n:m剪枝，连续m个权重，有n个非零权重

量化：主要包括PTQ（后训练量化）与QAT量化（量化感知训练，也叫训练中量化）

微调：主要的微调手段包括全参微调，也可以采用蒸馏的手段对压缩后（如剪枝与量化后）的模型进行微调，LoRA微调（主要用于大语言模型）

数据与模型准备

数据：COCO-Pose

本文采用的数据为COCO-Pose主要用于剪枝与量化的校验以及训练。数据集介绍见https://docs.ultralytics.com/zh/datasets/pose/coco/

下载地址

ultralytics中的coco-pose.yaml有详细介绍数据集的下载地址，见下图
在这里插入图片描述

ultralytics
- github地址: https://github.com/ultralytics/ultralytics
- coco-pose.yaml文件路径：ultralytics/cfg/datasets/coco-pose.yaml

由于coco-pose中的图片是coco数据集的子集，如果按照官方给的地址下载一定是下载的coco全量图片数据集，后续需要处理图片（当然也可以不处理，因为它是按照train2017.txt去读取图片的，但这里是完整处理后的图片数据集，确保images/train2017以及images/val2017的图片都属于coco-pose的

根据coco-pose.yaml提供的下载地址

第一步：下载coco2017labels-pose.zip文件并解压：https://github.com/ultralytics/assets/releases/download/v0.0.0/coco2017labels-pose.zip

解压后确保根目录为coco-pose, 根目录下存在以下文件夹和文件

在这里插入图片描述

注意到此时images下存在train2017与val2017文件夹，目前还是空的，因此还需下载图片

第二步：下载train2017.zip， val2017.zip并解压（注意到coco-pose的images目录下没有test2017，test2017是coco数据集的，所以这里只下载train2017.zip与val2017.zip）

train2017.zip文件下载地址：http://images.cocodataset.org/zips/train2017.zip

val2017.zp文件下载地址：http://images.cocodataset.org/zips/test2017.zip

第三步：可以将全量集图片train2017与val2017移动过去，也可以直接处理该数据集得到coco-pose对应的图片

全量数据集图片：将解压后的train2017与val2017移动到images目录下即可（注意此时是全量集图片）

coco-pose对应的coco子集图片数据(在coco-pose目录下运行数据处理的代码)：

from pathlib import Path
import shutilcoco_pose_train_txt = 'coco-pose/train2017.txt'
coco_pose_val_txt = 'coco-pose/val2017.txt'
coco_pose_train_img_dir = 'coco-pose/images/train2017'
coco_pose_val_img_dir = 'coco-pose/images/val2017'def copy_from_txt(txt_file, dest_dir):"""把 txt 中每行给出的图片路径复制到 dest_dir"""count = 0with txt_file.open('r', encoding='utf-8') as f:for line in f:src_img = Path(line.strip())src_img_path = Path('coco') / src_img  # Path('coco')根据自己存train2017或val2017图片进行更改(比如我的在coco/images/train2017)if src_img_path.is_file():count += 1shutil.copy2(src_img_path, Path(dest_dir) / Path(src_img_path.name))else:print(f'{src_img_path} not exists, check it')return countif __name__ == "__main__":count = copy_from_txt(coco_pose_val_txt, coco_pose_val_img_dir)print(f"val count:{count}")count = copy_from_txt(coco_pose_train_txt, coco_pose_train_img_dir)print(f"train count:{count}")

最后coco-pose数据集分布

train: 56599
val: 2346

将coco-pose数据保存到./datasets下， coco-pose目录结构如下：

datsets/
├── coco-pose/
│   ├── annotations/
│   │   ├── instances_val2017.json
│   │   ├── person_keypoints_val2017.json
│   ├── images/
│   │   ├── train2017  
│   │   ├── val2017  
│   ├── labels/
│   │   ├── train2017   
│   │   ├── val2017  
│   ├── my-coco-pose.yaml

模型：yolov8-pose

这里以yolov8s-pose为例

模型：Yolov8s-pose
下载：https://github.com/ultralytics/assets/releases
版本：v8.2.0
下载地址：https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8s-pose.pt

下载完成后将其保存到./weights目录下

模型基准测试

本文利用官方YOLO的推理测试代码测试了yolov8s-pose.pt的fp32、fp16 trt、int8 ptq以及int8 qat的性能，下面分别介绍各自代码细节

data_yaml_file = './datasets/my-coco-pose.yaml' # 测试数据
device = 'cuda:0'

my-coco-pose.yaml的内容如下

# Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license# COCO 2017 Keypoints dataset https://cocodataset.org by Microsoft
# Documentation: https://docs.ultralytics.com/datasets/pose/coco/
# Example usage: yolo train data=coco-pose.yaml
# parent
# ├── ultralytics
# └── datasets
#     └── coco-pose  ← downloads here (20.1 GB)# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ./datasets/coco-pose # dataset root dir
train: train2017.txt # train images (relative to 'path') 56599 images
val: val2017.txt # val images (relative to 'path') 2346 images
test: test-dev2017.txt # 20288 of 40670 images, submit to https://codalab.lisn.upsaclay.fr/competitions/7403# Keypoints
kpt_shape: [17, 3] # number of keypoints, number of dims (2 for x,y or 3 for x,y,visible)
flip_idx: [0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 16, 15]# Classes
names:0: person# Download script/URL (optional)
download: |from pathlib import Pathfrom ultralytics.utils.downloads import download# Download labelsdir = Path(yaml["path"])  # dataset root dirurl = "https://github.com/ultralytics/assets/releases/download/v0.0.0/"urls = [f"{url}coco2017labels-pose.zip"]download(urls, dir=dir.parent)# Download dataurls = ["http://images.cocodataset.org/zips/train2017.zip",  # 19G, 118k images"http://images.cocodataset.org/zips/val2017.zip",  # 1G, 5k images"http://images.cocodataset.org/zips/test2017.zip",  # 7G, 41k images (optional)]download(urls, dir=dir / "images", threads=3)

FP32

org_pt_path = './weights/yolov8s-pose.pt'
dense_model = YOLO(org_pt_path, task='pose')
metrics = dense_model.val(data=data_yaml_file, device=device)

FP16 TRT

可以采用官方的转engine代码，分为静态和动态模式（是否设置dynamic=True)

静态

dense_model.export(format='engine', half=True, device=device)
fp16_trt_engine = YOLO("your engine path", task='pose')
metrics = fp16_trt_engine.val(data=data_yaml_file, device=device)

动态

dense_model.export(format="engine", imgsz=640, dynamic=True, verbose=False, batch=8, workspace=2, half=True)
fp16_trt_engine = YOLO("your engine path", task='pose')
metrics = fp16_trt_engine.val(data=data_yaml_file, device=device, rect=Flase) # 必须要加rect=False

注意在动态情况下，推理时必须要加rect=False，否则测试出来的map接近0，不能得到正确的推理性能

rect表示是否强制让图片的大小resize到固定尺寸，否则会使用YOLO默认方式保持与原始图片一样的宽高比

INT8 PTQ

第一种，官方导出：PTQ可以采用官方导出的engine格式（只需要让int8=True)
第二种，python手写导出：手动转为onnx，再通过trt解析ONNX并按照trt序列化engine的标准流程——创建builder, network, parser, config以及int8校验器（calibrator根据自己的数据集进行创建）,是否设置动态shape(optShape, MinShape, MaxShape)，最后创建engin文件并写入
第三种，采用trtexec命令：但需要int8校验器的缓存数据（这通常在第二种方法中已经实现）

下面分别介绍以上3种方法的代码

官方导出

dense_model.export(format='engine', imgsz=640, dynamic=True, verbose=False, batch=8, workspace=2, int8=True, data=data_yaml_file)
ptq_trt_engine = YOLO("your engine path", task='pose')
metrics = ptq_trt_engine.val(data=data_yaml_file, rect=False, device=device)

Python手写

需要提供校验数据的dataloader

class CalibrationDataset:def __init__(self, dataset_dir, calibration_size, input_shape, cache_dir='./cache'):self.dataset_dir = Path(dataset_dir)self.calibration_size = calibration_sizeself.input_shape = input_shapeself.image_paths = self._get_image_paths()self.cache_dir = cache_dirself.images = self._load_images()  # list[nparray(chw), ...]def _get_image_paths(self):"""获取图像路径列表"""if not os.path.exists(self.dataset_dir):raise FileNotFoundError(f"data dir not found: {self.dataset_dir}")image_files = [p for p in self.dataset_dir.glob("*.jpg")]if not image_files:raise FileNotFoundError(f"no image files found: {self.dataset_dir}")return [str(p) for p in image_files[:self.calibration_size]]def _load_images(self)->list:dataset_name = str(self.dataset_dir).split('/')[-3]+ '-' + str(self.dataset_dir).split('/')[-2] + '-' + str(self.dataset_dir).split('/')[-1] # coco-pose-val2017self.calibration_size = min(self.calibration_size, len(self.image_paths))dataset_name = f'{dataset_name}-nsample{self.calibration_size}.npy'cache_data_path = os.path.join(self.cache_dir, dataset_name)if os.path.exists(cache_data_path):return np.load(cache_data_path, allow_pickle=True)"""preprocess images"""images = []for idx, image_path in tqdm(enumerate(self.image_paths), desc="processing data", total=len(self.image_paths)):try:# image = self._preprocess_image(image_path)image = self._yolo_process_image(image_path)images.append(image)except Exception as e:print(f"Failed to preprocess image {image_path}: {e}")np.save(cache_data_path, images)return imagesdef __len__(self):return len(self.images)def _preprocess_image(self, image_path):"""preprocess single image"""image = cv2.imread(image_path)if image is None:raise ValueError(f"cannot read image: {image_path}")image = cv2.resize(image, self.input_shape)image = image.astype(np.float32) / 255.0image = np.transpose(image, (2, 0, 1))  # HWC -> CHWreturn np.ascontiguousarray(image, dtype=np.float32)def _yolo_process_image(self, image_path):image = cv2.imread(image_path)if image is None:raise ValueError(f"cannot read image: {image_path}")# 1. BGR -> RGBimage = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)# 2. Letterbox resizeresized, _, _ = self.letterbox(image, new_shape=self.input_shape, auto=False, scaleup=True)# 3. HWC -> CHWimage = resized.transpose(2, 0, 1)# 4. Normalize to [0, 1]image = image.astype(np.float32) / 255.0return np.ascontiguousarray(image, dtype=np.float32)@staticmethoddef letterbox(im, new_shape=(640, 640), color=(114, 114, 114), auto=False, scaleFill=False, scaleup=True):# resize a rectangular image to a padded rectangleshape = im.shape[:2]  # current shape [height, width]if isinstance(new_shape, int):new_shape = (new_shape, new_shape)r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])if not scaleup:  # only scale downr