当前位置：首页 > news >正文

YOLOv5 详细讲解文档

news 2025/11/15 15:00:53

YOLOv5 详细讲解文档

1. YOLOv5简介

1.1 什么是YOLO？

YOLO (You Only Look Once) 是一种实时目标检测算法。与传统的两阶段检测器（如R-CNN系列）不同，YOLO将目标检测作为回归问题来处理，只需要一次前向传播就能得到所有目标的位置和类别。

1.2 YOLOv5的特点

速度快：单阶段检测器，推理速度快
精度高：在保持速度的同时，达到了很高的检测精度
易用性强：代码结构清晰，易于训练和部署
多种尺寸：提供n、s、m、l、x五种不同大小的模型

1.3 YOLOv5版本对比

模型	参数量	mAP@0.5	推理速度	适用场景
YOLOv5n	1.9M	28.0	最快	边缘设备
YOLOv5s	7.2M	37.4	快	移动设备
YOLOv5m	21.2M	45.4	中等	通用场景
YOLOv5l	46.5M	49.0	较慢	高精度需求
YOLOv5x	86.7M	50.7	慢	最高精度

2. 目标检测基础概念

2.1 边界框 (Bounding Box)

边界框用于标记图像中目标的位置，通常用四个值表示：

- (x, y): 边界框中心点坐标
- w: 边界框宽度
- h: 边界框高度

表示方式：

中心点表示法 (x_center, y_center, width, height) - YOLO使用
左上角表示法 (x_min, y_min, x_max, y_max) - 也称为xyxy格式

2.2 锚框 (Anchor Boxes)

锚框是预定义的一组不同尺寸和长宽比的边界框，用于帮助网络学习不同形状的目标。

为什么需要锚框？

不同的目标有不同的形状和大小
锚框提供了检测的"起点"
网络只需要预测相对于锚框的偏移量，而不是直接预测绝对坐标

YOLOv5的锚框设置：

# 三个检测层，每层3个锚框
anchors = [[10,13, 16,30, 33,23],   # P3/8  - 小目标检测层[30,61, 62,45, 59,119],  # P4/16 - 中目标检测层[116,90, 156,198, 373,326]  # P5/32 - 大目标检测层
]

2.3 IoU (Intersection over Union)

IoU用于衡量两个边界框的重叠程度：

IoU = (Area of Intersection) / (Area of Union)

应用场景：

匹配预测框和真实框
非极大值抑制(NMS)
评估检测精度

2.4 非极大值抑制 (NMS)

NMS用于去除重复的检测框，只保留最好的那个。

流程：

按置信度对所有预测框排序
选择置信度最高的框A
计算A与其他框的IoU
移除IoU > 阈值的框（认为是重复检测）
重复2-4步，直到处理完所有框

2.5 mAP (mean Average Precision)

mAP是目标检测中最常用的评估指标：

Precision（精确率）：预测为正样本中真正为正样本的比例
Recall（召回率）：所有正样本中被正确预测的比例
AP：不同recall下的precision平均值
mAP：所有类别AP的平均值

常见标记：

mAP@0.5：IoU阈值为0.5时的mAP
mAP@0.5:0.95：IoU从0.5到0.95，步长0.05的平均mAP

3. YOLOv5网络架构

YOLOv5的网络结构可以分为四个部分：

输入(Input) → 骨干网络(Backbone) → 颈部(Neck) → 检测头(Head) → 输出(Output)

3.1 整体架构图

Input (640x640x3)↓
Backbone (CSPDarknet53)├─ Focus/Conv├─ CSP1_1 → P1├─ CSP1_3 → P2├─ CSP2_3 → P3 (80x80) ────┐├─ CSP2_3 → P4 (40x40) ────┼─→ Neck (PANet)└─ CSP2_1+SPPF → P5 (20x20)─┘↓┌───────────────────────┐│   Neck (PANet/FPN)    ││  ┌─────────────────┐  ││  │  P5 → Up → P4   │  ││  │  P4 → Up → P3   │  ││  │  P3 → Down → P4 │  ││  │  P4 → Down → P5 │  ││  └─────────────────┘  │└───────────────────────┘↓Detection Head┌────────┬────────┬────────┐│  P3    │  P4    │  P5    ││ 80x80  │ 40x40  │ 20x20  ││ 小目标  │ 中目标  │ 大目标  │└────────┴────────┴────────┘↓[x, y, w, h, conf, class_probs]

3.2 详细参数流程

以YOLOv5s为例（输入图像640x640）：

层	类型	输出尺寸	参数
0	Focus	320×320×32	-
1	Conv	320×320×64	k=3, s=2
2	C3	320×320×64	n=1
3	Conv	160×160×128	k=3, s=2
4	C3	160×160×128	n=2
5	Conv	80×80×256	k=3, s=2
6	C3	80×80×256	n=3 → P3
7	Conv	40×40×512	k=3, s=2
8	C3	40×40×512	n=1
9	SPPF	40×40×512	k=5 → P5
10	Conv	40×40×256	k=1, s=1
11	Upsample	80×80×256	scale=2
12	Concat	80×80×512	[P3, 11]
13	C3	80×80×256	n=1
14	Conv	80×80×128	k=1, s=1
15	Upsample	160×160×128	scale=2
…	…	…	…

4. 关键模块详解

4.1 Focus模块

作用：在不损失信息的情况下降低计算量

原理：将空间信息集中到通道维度

将 H×W×C 的图像分为4个部分
每个部分间隔采样，然后在通道维度拼接
输出为 H/2×W/2×4C

代码实现：

class Focus(nn.Module):"""将空间信息聚焦到通道空间输入: (b, c, h, w)输出: (b, 4c, h/2, w/2)"""def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):super().__init__()# 输入通道扩大4倍，因为做了4次切片拼接self.conv = Conv(c1 * 4, c2, k, s, p, g, act=act)def forward(self, x):# 间隔采样：[::2, ::2]表示从偶数位置开始，每隔一个取一个return self.conv(torch.cat([x[..., ::2, ::2],    # 左上x[..., 1::2, ::2],   # 右上x[..., ::2, 1::2],   # 左下x[..., 1::2, 1::2]   # 右下], 1))

示例：

输入: 640×640×3
↓
间隔采样得到4个 320×320×3 的特征图
↓
拼接: 320×320×12
↓
卷积: 320×320×32

4.2 Conv模块（标准卷积块）

组成：卷积 + 批归一化 + 激活函数

代码实现：

class Conv(nn.Module):"""标准卷积块：Conv2d + BatchNorm + Activation"""default_act = nn.SiLU()  # 默认激活函数：SiLU (Swish)def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):"""参数说明：c1: 输入通道数c2: 输出通道数k: 卷积核大小s: 步长 stridep: 填充 padding (None时自动计算)g: 分组数 groupsd: 膨胀率 dilationact: 激活函数（True使用默认SiLU）"""super().__init__()# 卷积层self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)# 批归一化self.bn = nn.BatchNorm2d(c2)# 激活函数self.act = self.default_act if act is True else \act if isinstance(act, nn.Module) else nn.Identity()def forward(self, x):return self.act(self.bn(self.conv(x)))

自动填充函数：

def autopad(k, p=None, d=1):"""自动计算padding，使输出尺寸保持不变（当stride=1时）"""if d > 1:# 考虑膨胀卷积的实际卷积核大小k = d * (k - 1) + 1 if isinstance(k, int) else \[d * (x - 1) + 1 for x in k]if p is None:# 计算same paddingp = k // 2 if isinstance(k, int) else [x // 2 for x in k]return p

4.3 Bottleneck模块

作用：类似ResNet的瓶颈结构，减少参数量

特点：

1×1卷积降维 → 3×3卷积 → 1×1卷积升维
残差连接（shortcut）

代码实现：

class Bottleneck(nn.Module):"""标准瓶颈层，带可选的残差连接"""def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):"""c1: 输入通道c2: 输出通道shortcut: 是否使用残差连接g: 分组卷积的组数e: 通道扩展比例（隐藏层通道数 = c2 * e）"""super().__init__()c_ = int(c2 * e)  # 隐藏层通道数self.cv1 = Conv(c1, c_, 1, 1)      # 1×1降维self.cv2 = Conv(c_, c2, 3, 1, g=g) # 3×3卷积# 只有当输入输出通道相同且shortcut=True时才使用残差self.add = shortcut and c1 == c2def forward(self, x):# 如果使用残差：out = x + conv(x)# 否则：out = conv(x)return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

4.4 C3模块（CSP Bottleneck）

作用：YOLOv5的核心模块，基于CSPNet思想

CSP (Cross Stage Partial) 的优势：

减少计算量
增强梯度流动
提高推理速度

结构图：

输入 x├─→ cv1 → Bottleneck序列 → cv3(concat) →┐│                                        ├→ 输出└─→ cv2 ─────────────────────────────→┘

代码实现：

class C3(nn.Module):"""CSP Bottleneck with 3 convolutions"""def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):"""c1: 输入通道c2: 输出通道n: bottleneck重复次数shortcut: bottleneck中是否使用残差g: 分组卷积组数e: 通道扩展比例"""super().__init__()c_ = int(c2 * e)  # 隐藏通道数self.cv1 = Conv(c1, c_, 1, 1)  # 第一条路径self.cv2 = Conv(c1, c_, 1, 1)  # 第二条路径（直连）self.cv3 = Conv(2 * c_, c2, 1) # 融合层# n个串联的Bottleneckself.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))def forward(self, x):# 两条路径concat后融合return self.cv3(torch.cat((self.m(self.cv1(x)),  # 经过Bottleneck序列self.cv2(x)           # 直接连接), 1))

为什么使用C3？

分流设计减少了重复的梯度信息
提高了网络的学习效率
在保持精度的同时降低了计算成本

4.5 SPPF模块（Spatial Pyramid Pooling - Fast）

作用：多尺度特征融合，增大感受野

SPP vs SPPF：

SPP：并行多个池化核（5×5, 9×9, 13×13）
SPPF：串行多个相同池化核（5×5）- 更快！

结构对比：

SPP:input → conv → ┬─ maxpool(5) ─┐├─ maxpool(9) ─┤→ concat → conv → output└─ maxpool(13)─┘SPPF:input → conv → maxpool(5) → maxpool(5) → maxpool(5) → concat → conv → output↓            ↓            ↓保存         保存          保存

代码实现：

class SPPF(nn.Module):"""快速空间金字塔池化等价于 SPP(k=(5, 9, 13))，但速度更快"""def __init__(self, c1, c2, k=5):"""c1: 输入通道c2: 输出通道k: 池化核大小（默认5）"""super().__init__()c_ = c1 // 2  # 隐藏通道数self.cv1 = Conv(c1, c_, 1, 1)  # 降维self.cv2 = Conv(c_ * 4, c2, 1, 1)  # 升维（4倍因为concat了4个特征）self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)def forward(self, x):x = self.cv1(x)# 串行池化y1 = self.m(x)y2 = self.m(y1)y3 = self.m(y2)# 拼接原始x和三次池化结果return self.cv2(torch.cat((x, y1, y2, y3), 1))

感受野计算：

单次 5×5 MaxPool: 感受野 = 5
两次串行:         感受野 = 5 + 4 = 9
三次串行:         感受野 = 5 + 4 + 4 = 13

与SPP的k=(5,9,13)等价，但计算更快！

4.6 PANet (Path Aggregation Network)

作用：YOLOv5的Neck部分，用于多尺度特征融合

设计思想：

自底向上：低层特征 → 高层特征（FPN）
自顶向下：高层特征 → 低层特征（额外路径）

结构流程：

Backbone输出:P3 (80×80×256)   P4 (40×40×512)   P5 (20×20×512)↓                 ↓                 ↓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━│  FPN (自顶向下)  ││   P5 → Up → P4   ││   P4 → Up → P3   │
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━│  PAN (自底向上)  ││   P3 → Down → P4 ││   P4 → Down → P5 │
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━↓                 ↓                 ↓检测头P3         检测头P4           检测头P5(小目标)         (中目标)           (大目标)

为什么需要PANet？

不同尺度目标：小、中、大目标需要不同层次的特征
特征增强：低层特征（细节）+ 高层特征（语义）
定位精度：低层特征有助于精确定位

4.7 Detect模块（检测头）

作用：将特征图转换为检测结果

输入：三个不同尺度的特征图

P3: 80×80 - 检测小目标
P4: 40×40 - 检测中目标
P5: 20×20 - 检测大目标

输出：每个格子预测3个锚框，每个锚框预测：

(x, y): 边界框中心相对于格子的偏移
(w, h): 边界框的宽高
confidence: 目标置信度
class_probs: 各类别概率（假设80类）

输出维度：(bs, na, ny, nx, no)

bs: batch size
na: 每个格子的锚框数 = 3
ny, nx: 特征图高、宽
no: 每个锚框的输出 = 5 + nc (4个坐标 + 1个置信度 + nc个类别)

代码实现：

class Detect(nn.Module):"""YOLOv5检测头"""stride = None  # 相对于输入图像的下采样倍数def __init__(self, nc=80, anchors=(), ch=(), inplace=True):"""nc: 类别数anchors: 锚框配置 [[10,13, 16,30, 33,23],[30,61, 62,45, 59,119],[116,90, 156,198, 373,326]]ch: 输入通道数列表 [256, 512, 1024]"""super().__init__()self.nc = nc                          # 类别数self.no = nc + 5                      # 每个锚框的输出数self.nl = len(anchors)                # 检测层数 = 3self.na = len(anchors[0]) // 2        # 每层锚框数 = 3self.grid = [torch.empty(0) for _ in range(self.nl)]self.anchor_grid = [torch.empty(0) for _ in range(self.nl)]# 注册为buffer，模型保存时会保存这个值self.register_buffer('anchors', torch.tensor(anchors).float().view(self.nl, -1, 2))# 输出卷积：将特征图转换为预测值# 输入ch[i]，输出 na * noself.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)self.inplace = inplacedef forward(self, x):"""x: 列表，包含3个特征图x[0]: (bs, 256, 80, 80) - P3x[1]: (bs, 512, 40, 40) - P4x[2]: (bs, 1024, 20, 20) - P5返回:训练时: x - 原始预测值推理时: (inference_output, x) - 解码后的预测 + 原始预测"""z = []  # 推理输出for i in range(self.nl):  # 遍历3个检测层x[i] = self.m[i](x[i])  # 卷积预测bs, _, ny, nx = x[i].shape  # 例如 x[i]: (bs, 255, 80, 80) -> (bs, 3, 85, 80, 80)x[i] = x[i].view(bs, self.na, self.no, ny, nx)\.permute(0, 1, 3, 4, 2).contiguous()# 现在 x[i]: (bs, 3, 80, 80, 85)if not self.training:  # 推理模式# 生成网格if self.grid[i].shape[2:4] != x[i].shape[2:4]:self.grid[i], self.anchor_grid[i] = \self._make_grid(nx, ny, i)# 解码预测值xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)# xy: 中心点相对于格子的偏移 (0~1)xy = (xy * 2 + self.grid[i]) * self.stride[i]  # 转为相对于原图的坐标# wh: 宽高wh = (wh * 2) ** 2 * self.anchor_grid[i]  # 相对于锚框的缩放y = torch.cat((xy, wh, conf), 4)z.append(y.view(bs, self.na * nx * ny, self.no))return x if self.training else (torch.cat(z, 1), x)def _make_grid(self, nx=20, ny=20, i=0):"""生成网格坐标和锚框网格nx, ny: 特征图宽高i: 第几个检测层返回: (grid, anchor_grid)"""d = self.anchors[i].devicet = self.anchors[i].dtypeshape = 1, self.na, ny, nx, 2  # (1, 3, 20, 20, 2)# 生成网格坐标y, x = torch.arange(ny, device=d, dtype=t), \torch.arange(nx, device=d, dtype=t)yv, xv = torch.meshgrid(y, x, indexing='ij')# grid: 每个格子的左上角坐标grid = torch.stack((xv, yv), 2).expand(shape) - 0.5# anchor_grid: 锚框尺寸 × strideanchor_grid = (self.anchors[i] * self.stride[i])\.view((1, self.na, 1, 1, 2)).expand(shape)return grid, anchor_grid

预测解码过程：

原始预测值 (tx, ty, tw, th, conf, cls)

中心点解码：

# sigmoid将值限制在0~1
# *2 将范围扩大到0~2
# +grid_x, +grid_y 加上格子坐标
# *stride 转换到原图尺度
cx = (sigmoid(tx) * 2 - 0.5 + grid_x) * stride
cy = (sigmoid(ty) * 2 - 0.5 + grid_y) * stride

宽高解码：

# sigmoid将值限制在0~1
# *2 扩大到0~2
# **2 平方，范围0~4
# *anchor_w/h 相对于锚框缩放
w = (sigmoid(tw) * 2) ** 2 * anchor_w
h = (sigmoid(th) * 2) ** 2 * anchor_h

置信度和类别：

conf = sigmoid(conf)      # 目标置信度
cls = sigmoid(cls_logits) # 各类别概率

5. 损失函数

YOLOv5的损失函数由三部分组成：

Total Loss = λ₁ × Box Loss + λ₂ × Object Loss + λ₃ × Class Loss

5.1 边界框损失 (Box Loss)

使用CIoU Loss：

CIoU考虑了：

重叠面积
中心点距离
长宽比

def bbox_iou(box1, box2, CIoU=True):"""计算边界框的IoU或CIoUbox1, box2: (x, y, w, h) 格式"""# 转换为 (x1, y1, x2, y2) 格式b1_x1, b1_y1 = box1[..., 0] - box1[..., 2] / 2, box1[..., 1] - box1[..., 3] / 2b1_x2, b1_y2 = box1[..., 0] + box1[..., 2] / 2, box1[..., 1] + box1[..., 3] / 2b2_x1, b2_y1 = box2[..., 0] - box2[..., 2] / 2, box2[..., 1] - box2[..., 3] / 2b2_x2, b2_y2 = box2[..., 0] + box2[..., 2] / 2, box2[..., 1] + box2[..., 3] / 2# 交集面积inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \(torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)# 并集面积w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1union = w1 * h1 + w2 * h2 - inter + 1e-16iou = inter / unionif CIoU:# 最小外接矩形cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1)ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1)# 对角线距离c2 = cw ** 2 + ch ** 2 + 1e-16# 中心点距离rho2 = ((b2_x1 + b2_x2 - b1_x1 - b1_x2) ** 2 + (b2_y1 + b2_y2 - b1_y1 - b1_y2) ** 2) / 4# 长宽比一致性v = (4 / math.pi ** 2) * torch.pow(torch.atan(w2 / (h2 + 1e-16)) - torch.atan(w1 / (h1 + 1e-16)), 2)alpha = v / (v - iou + 1 + 1e-16)# CIoUreturn iou - (rho2 / c2 + v * alpha)return iou

Box Loss计算：

# 预测框和真实框
pbox = torch.cat((pxy, pwh), 1)  # 预测的 (x, y, w, h)
iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()
lbox += (1.0 - iou).mean()  # CIoU loss

5.2 目标置信度损失 (Objectness Loss)

使用BCE Loss：

class BCEWithLogitsLoss:"""二元交叉熵损失（带logits）"""pass# 计算目标置信度损失
BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']]))# 目标分配：将IoU作为置信度的目标值
tobj[b, a, gj, gi] = iou.detach().clamp(0).type(tobj.dtype)# 计算损失
lobj += self.BCEobj(pi[..., 4], tobj) * self.balance[i]

为什么用IoU作为目标？

IoU高 → 预测框与真实框重叠好 → 置信度应该高
IoU低 → 预测框与真实框重叠差 → 置信度应该低

5.3 分类损失 (Classification Loss)

使用BCE Loss（多标签分类）：

# 类别目标（使用标签平滑）
cp, cn = smooth_BCE(eps=0.0)  # positive, negative targets
t = torch.full_like(pcls, cn)  # 初始化为负样本目标
t[range(n), tcls[i]] = cp       # 正样本位置设为正样本目标# 计算分类损失
lcls += self.BCEcls(pcls, t)

标签平滑 (Label Smoothing)：

def smooth_BCE(eps=0.1):"""标签平滑，防止过拟合正样本: 1.0 → 1.0 - 0.5*eps = 0.95负样本: 0.0 → 0.5*eps = 0.05"""return 1.0 - 0.5 * eps, 0.5 * eps

5.4 完整损失计算流程

class ComputeLoss:"""YOLOv5损失计算"""def __init__(self, model, autobalance=False):device = next(model.parameters()).deviceh = model.hyp  # 超参数# 定义损失函数BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['cls_pw']], device=device))BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']], device=device))# 标签平滑self.cp, self.cn = smooth_BCE(eps=h.get('label_smoothing', 0.0))# Focal Loss（可选）g = h['fl_gamma']if g > 0:BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)m = model.model[-1]  # Detect模块self.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, 0.02])self.BCEcls, self.BCEobj = BCEcls, BCEobjself.hyp = hself.na = m.na  # 锚框数self.nc = m.nc  # 类别数self.nl = m.nl  # 检测层数self.anchors = m.anchorsself.device = devicedef __call__(self, p, targets):"""p: 预测值，列表包含3个检测层的输出targets: 真实标签 (image_idx, class, x, y, w, h)返回: (total_loss, loss_items)"""lcls = torch.zeros(1, device=self.device)  # 分类损失lbox = torch.zeros(1, device=self.device)  # 边界框损失lobj = torch.zeros(1, device=self.device)  # 目标损失# 构建目标tcls, tbox, indices, anchors = self.build_targets(p, targets)# 遍历每个检测层for i, pi in enumerate(p):b, a, gj, gi = indices[i]  # image, anchor, gridy, gridxtobj = torch.zeros(pi.shape[:4], dtype=pi.dtype, device=self.device)n = b.shape[0]  # 目标数量if n:# 提取对应位置的预测pxy, pwh, _, pcls = pi[b, a, gj, gi].split((2, 2, 1, self.nc), 1)# === 边界框损失 ===pxy = pxy.sigmoid() * 2 - 0.5pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]pbox = torch.cat((pxy, pwh), 1)iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()lbox += (1.0 - iou).mean()# === 目标置信度 ===tobj[b, a, gj, gi] = iou.detach().clamp(0).type(tobj.dtype)# === 分类损失 ===if self.nc > 1:t = torch.full_like(pcls, self.cn, device=self.device)t[range(n), tcls[i]] = self.cplcls += self.BCEcls(pcls, t)# 所有位置的目标置信度损失obji = self.BCEobj(pi[..., 4], tobj)lobj += obji * self.balance[i]# 加权lbox *= self.hyp['box']lobj *= self.hyp['obj']lcls *= self.hyp['cls']bs = tobj.shape[0]return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

5.5 目标分配策略

如何确定哪个锚框负责预测哪个目标？

def build_targets(self, p, targets):"""为每个目标分配合适的锚框策略：1. 锚框匹配：选择与目标宽高比最接近的锚框2. 跨网格匹配：允许相邻格子的锚框也参与预测"""na, nt = self.na, targets.shape[0]tcls, tbox, indices, anch = [], [], [], []gain = torch.ones(7, device=self.device)# 将每个目标复制na份，为每个锚框准备一个ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)g = 0.5  # 偏移比例# 5个方向的偏移：中心、左、上、右、下off = torch.tensor([[0, 0],[1, 0], [0, 1], [-1, 0], [0, -1],], device=self.device).float() * gfor i in range(self.nl):  # 遍历每个检测层anchors = self.anchors[i]gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain# 将目标坐标转换到当前特征图尺度t = targets * gainif nt:# === 锚框匹配 ===r = t[..., 4:6] / anchors[:, None]  # 宽高比j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # 比值阈值t = t[j]  # 保留匹配的目标# === 跨网格匹配 ===gxy = t[:, 2:4]  # 中心点坐标gxi = gain[[2, 3]] - gxy  # 到右下角的距离j, k = ((gxy % 1 < g) & (gxy > 1)).T  # 接近左边或上边l, m = ((gxi % 1 < g) & (gxi > 1)).T  # 接近右边或下边j = torch.stack((torch.ones_like(j), j, k, l, m))t = t.repeat((5, 1, 1))[j]offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]else:t = targets[0]offsets = 0# 提取目标信息bc, gxy, gwh, a = t.chunk(4, 1)a, (b, c) = a.long().view(-1), bc.long().Tgij = (gxy - offsets).long()gi, gj = gij.T# 保存结果indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))tbox.append(torch.cat((gxy - gij, gwh), 1))anch.append(anchors[a])tcls.append(c)return tcls, tbox, indices, anch

关键点：

锚框匹配：宽高比在阈值内（默认4.0）的锚框才会匹配
跨网格预测：允许相邻格子的锚框也参与，增加正样本数量
多层预测：每个目标可能在多个检测层被预测

6. 训练过程

6.1 数据准备

数据格式（YOLO格式）：

# 图像文件: images/train/img1.jpg
# 标签文件: labels/train/img1.txt# 标签格式（每行一个目标）：
class_id x_center y_center width height

示例：

0 0.5 0.5 0.3 0.4  # 类别0，中心(0.5, 0.5)，宽0.3，高0.4（归一化坐标）
2 0.2 0.3 0.1 0.15 # 类别2，中心(0.2, 0.3)，宽0.1，高0.15

6.2 数据增强

YOLOv5使用多种数据增强技术：

1. Mosaic增强：

# 将4张图像拼接成一张
# ┌─────┬─────┐
# │ img1│ img2│
# ├─────┼─────┤
# │ img3│ img4│
# └─────┴─────┘

优势：

增加小目标数量
增加背景多样性
减少GPU数量需求（batch_size可以变相增大）

2. 其他增强：

Random Flip（随机翻转）
Random Scale（随机缩放）
Random Crop（随机裁剪）
Random HSV（色彩抖动）
MixUp
CutOut

6.3 训练配置

超参数文件示例 (hyp.yaml)：

# 优化器参数
lr0: 0.01          # 初始学习率
lrf: 0.1           # 最终学习率 (lr0 * lrf)
momentum: 0.937    # SGD momentum
weight_decay: 0.0005  # 权重衰减# 损失权重
box: 0.05          # box loss权重
cls: 0.5           # class loss权重
obj: 1.0           # object loss权重# 锚框参数
anchor_t: 4.0      # 锚框匹配阈值# 增强参数
hsv_h: 0.015       # HSV-Hue增强
hsv_s: 0.7         # HSV-Saturation增强
hsv_v: 0.4         # HSV-Value增强
degrees: 0.0       # 旋转角度
translate: 0.1     # 平移
scale: 0.5         # 缩放
shear: 0.0         # 剪切
perspective: 0.0   # 透视变换
flipud: 0.0        # 上下翻转概率
fliplr: 0.5        # 左右翻转概率
mosaic: 1.0        # mosaic增强概率
mixup: 0.0         # mixup增强概率

6.4 训练流程

完整训练代码框架：

def train(hyp, opt):"""YOLOv5训练主函数hyp: 超参数字典opt: 训练选项"""# ==================== 1. 初始化 ====================# 设置随机种子torch.manual_seed(0)# 选择设备device = select_device(opt.device)# 创建模型model = Model(opt.cfg, ch=3, nc=opt.nc).to(device)# 冻结层（可选）freeze = [f'model.{x}.' for x in range(opt.freeze)]for k, v in model.named_parameters():v.requires_grad = Trueif any(x in k for x in freeze):v.requires_grad = False# ==================== 2. 优化器 ====================# 参数分组g0, g1, g2 = [], [], []  # 优化器参数组for v in model.modules():if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):g2.append(v.bias)  # biasesif isinstance(v, nn.BatchNorm2d):g0.append(v.weight)  # BN权重（不使用weight_decay）elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):g1.append(v.weight)  # 卷积权重（使用weight_decay）# 创建优化器optimizer = optim.SGD(g0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)optimizer.add_param_group({'params': g1, 'weight_decay': hyp['weight_decay']})optimizer.add_param_group({'params': g2})  # biases# ==================== 3. 学习率调度器 ====================lf = lambda x: (1 - x / epochs) * (1.0 - hyp['lrf']) + hyp['lrf']scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)# ==================== 4. 数据加载器 ====================train_loader = create_dataloader(train_path, imgsz, batch_size, stride,hyp=hyp, augment=True, cache=opt.cache)val_loader = create_dataloader(val_path, imgsz, batch_size, stride,hyp=hyp, augment=False, cache=opt.cache)# ==================== 5. 损失函数 ====================compute_loss = ComputeLoss(model)# ==================== 6. 训练循环 ====================for epoch in range(epochs):model.train()pbar = tqdm(enumerate(train_loader), total=len(train_loader))for i, (imgs, targets, paths, _) in pbar:# 数据转移到GPUimgs = imgs.to(device).float() / 255.0  # 归一化到0-1targets = targets.to(device)# === 前向传播 ===pred = model(imgs)  # 预测loss, loss_items = compute_loss(pred, targets)  # 计算损失# === 反向传播 ===optimizer.zero_grad()  # 清空梯度loss.backward()         # 反向传播optimizer.step()        # 更新参数# === 记录信息 ===pbar.set_description(f'Epoch {epoch}/{epochs} 'f'loss: {loss.item():.4f} 'f'box: {loss_items[0]:.4f} 'f'obj: {loss_items[1]:.4f} 'f'cls: {loss_items[2]:.4f}')# ==================== 7. 验证 ====================if epoch % opt.eval_interval == 0:results, maps = validate(model, val_loader, device, compute_loss)# 保存最佳模型if maps > best_fitness:best_fitness = mapstorch.save({'epoch': epoch,'model': model.state_dict(),'optimizer': optimizer.state_dict(),}, 'best.pt')# ==================== 8. 学习率更新 ====================scheduler.step()

6.5 关键训练技巧

1. Warmup（预热）：

# 前几个epoch使用较小的学习率
if epoch < warmup_epochs:xi = [0, warmup_epochs]for j, x in enumerate(optimizer.param_groups):x['lr'] = np.interp(epoch, xi, [warmup_bias_lr if j == 2 else 0.0, x['initial_lr'] * lf(epoch)])

2. EMA（指数移动平均）：

# 使用模型参数的移动平均来提高稳定性
ema = ModelEMA(model)
for epoch in range(epochs):# 训练...ema.update(model)  # 更新EMA模型

3. 自动锚框：

# 根据数据集自动计算最优锚框
from utils.autoanchor import check_anchors
check_anchors(dataset, model, thr=4.0, imgsz=640)

4. 混合精度训练：

# 使用FP16加速训练
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():pred = model(imgs)loss, loss_items = compute_loss(pred, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

7. 预测推理

7.1 推理流程

def detect(model, source, device, conf_thres=0.25, iou_thres=0.45):"""YOLOv5推理函数model: 训练好的模型source: 图像路径或视频路径conf_thres: 置信度阈值iou_thres: NMS的IoU阈值"""# ==================== 1. 加载模型 ====================model.eval()model.to(device)# ==================== 2. 加载图像 ====================# 读取图像img0 = cv2.imread(source)  # BGR格式# 预处理img = letterbox(img0, new_shape=640)[0]  # 调整大小，保持长宽比img = img.transpose((2, 0, 1))[::-1]     # HWC转CHW，BGR转RGBimg = np.ascontiguousarray(img)          # 连续内存img = torch.from_numpy(img).to(device)img = img.float() / 255.0                # 归一化if img.ndimension() == 3:img = img.unsqueeze(0)               # 添加batch维度# ==================== 3. 推理 ====================with torch.no_grad():pred = model(img)[0]  # 前向传播# pred形状: (1, 25200, 85)# 25200 = 80×80×3 + 40×40×3 + 20×20×3# 85 = 4(坐标) + 1(置信度) + 80(类别)# ==================== 4. NMS（非极大值抑制）====================pred = non_max_suppression(pred, conf_thres=conf_thres,  # 置信度阈值iou_thres=iou_thres,    # IoU阈值max_det=300             # 最大检测数)# ==================== 5. 后处理 ====================for i, det in enumerate(pred):  # 遍历每张图像if len(det):# 将坐标从640×640映射回原图尺寸det[:, :4] = scale_boxes(img.shape[2:], det[:, :4], img0.shape).round()# 绘制结果for *xyxy, conf, cls in reversed(det):label = f'{names[int(cls)]} {conf:.2f}'plot_one_box(xyxy, img0, label=label, color=colors[int(cls)])# ==================== 6. 保存结果 ====================cv2.imwrite('result.jpg', img0)return img0

7.2 NMS详细实现

def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, classes=None, max_det=300):"""非极大值抑制prediction: (batch_size, num_boxes, 85) 模型预测conf_thres: 置信度阈值iou_thres: NMS的IoU阈值classes: 只保留特定类别（None表示所有类别）max_det: 每张图像最大检测数返回: 列表，每个元素是一张图像的检测结果 (n, 6) [x1, y1, x2, y2, conf, cls]"""# ==================== 1. 筛选 ====================# 计算类别置信度 = 目标置信度 × 类别概率xc = prediction[..., 4] > conf_thres  # 候选框# 设置min_wh, max_wh = 2, 7680  # 最小/最大宽高（像素）max_nms = 30000           # NMS前的最大框数time_limit = 10.0         # 超时时间output = [torch.zeros((0, 6), device=prediction.device)] * prediction.shape[0]# ==================== 2. 遍历每张图像 ====================for xi, x in enumerate(prediction):  # 对每张图像x = x[xc[xi]]  # 筛选候选框if not x.shape[0]:continue# === 计算最终置信度 ===x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf# === 转换坐标格式 ===box = xywh2xyxy(x[:, :4])  # (center_x, center_y, w, h) → (x1, y1, x2, y2)# === 多标签处理 ===conf, j = x[:, 5:].max(1, keepdim=True)  # 最大类别置信度和索引x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]# === 类别过滤 ===if classes is not None:x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]# === 限制检测框数量 ===n = x.shape[0]if not n:continueelif n > max_nms:x = x[x[:, 4].argsort(descending=True)[:max_nms]]# ==================== 3. NMS ====================c = x[:, 5:6] * max_wh  # 类别偏移boxes, scores = x[:, :4] + c, x[:, 4]  # boxes偏移，同类别框才会抑制i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMSif i.shape[0] > max_det:i = i[:max_det]output[xi] = x[i]return output

NMS工作原理：

1. 按置信度排序：[0.9, 0.8, 0.7, 0.6, ...]
2. 选择最高的框A（0.9）
3. 计算A与其他框的IoU
4. 移除IoU > 阈值的框
5. 重复2-4

示意图：

初始:  [A:0.9]  [B:0.8]  [C:0.7]  [D:0.6]↓
选A:   [A:✓]   [B:?]    [C:?]    [D:?]↓
IoU(A,B)=0.6 > 0.45 → 移除B
IoU(A,C)=0.2 < 0.45 → 保留C
IoU(A,D)=0.7 > 0.45 → 移除D↓
结果:  [A:✓]           [C:✓]

7.3 结果可视化

def plot_one_box(xyxy, img, color=None, label=None, line_thickness=3):"""在图像上绘制一个边界框xyxy: 边界框坐标 (x1, y1, x2, y2)img: 图像 (numpy array)color: 框颜色 (B, G, R)label: 标签文本"""tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1color = color or [random.randint(0, 255) for _ in range(3)]c1, c2 = (int(xyxy[0]), int(xyxy[1])), (int(xyxy[2]), int(xyxy[3]))# 绘制矩形cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)# 绘制标签if label:tf = max(tl - 1, 1)  # 字体粗细t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # 填充cv2.putText(img, label, (c1[0], c1[1] - 2), 0, tl / 3, [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA)

8. 代码实例分析

8.1 完整的训练示例

"""
train.py - YOLOv5训练脚本
"""import argparse
import torch
from pathlib import Path
from models.yolo import Model
from utils.loss import ComputeLoss
from utils.dataloaders import create_dataloaderdef train(opt):# === 配置 ===epochs = opt.epochsbatch_size = opt.batch_sizeimg_size = opt.img_sizedevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')# === 加载模型 ===model = Model(opt.cfg, ch=3, nc=opt.nc).to(device)print(f'Model: {opt.cfg}')print(f'Classes: {opt.nc}')# === 优化器 ===optimizer = torch.optim.SGD(model.parameters(),lr=0.01,momentum=0.937,weight_decay=0.0005)# === 学习率调度 ===scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)# === 数据加载 ===train_loader = create_dataloader(path=opt.data / 'train',imgsz=img_size,batch_size=batch_size,augment=True)val_loader = create_dataloader(path=opt.data / 'val',imgsz=img_size,batch_size=batch_size,augment=False)# === 损失函数 ===compute_loss = ComputeLoss(model)# === 训练循环 ===best_fitness = 0.0for epoch in range(epochs):model.train()# 训练一个epochfor batch_i, (imgs, targets, paths, _) in enumerate(train_loader):imgs = imgs.to(device).float() / 255.0targets = targets.to(device)# 前向pred = model(imgs)loss, loss_items = compute_loss(pred, targets)# 反向optimizer.zero_grad()loss.backward()optimizer.step()# 打印if batch_i % 10 == 0:print(f'Epoch {epoch}/{epochs} 'f'Batch {batch_i}/{len(train_loader)} 'f'Loss {loss.item():.4f}')# 验证if epoch % 5 == 0:fitness = validate(model, val_loader, device)if fitness > best_fitness:best_fitness = fitnesstorch.save(model.state_dict(), 'best.pt')print(f'Saved best model with fitness {fitness:.4f}')scheduler.step()if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('--cfg', type=str, default='yolov5s.yaml')parser.add_argument('--data', type=Path, default='data/coco')parser.add_argument('--nc', type=int, default=80)parser.add_argument('--epochs', type=int, default=300)parser.add_argument('--batch-size', type=int, default=16)parser.add_argument('--img-size', type=int, default=640)opt = parser.parse_args()train(opt)

8.2 完整的推理示例

"""
detect.py - YOLOv5推理脚本
"""import argparse
import cv2
import torch
from models.experimental import attempt_load
from utils.general import non_max_suppression, scale_boxes
from utils.plots import plot_one_boxdef detect(opt):# === 配置 ===device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')# === 加载模型 ===model = attempt_load(opt.weights, device=device)model.eval()stride = int(model.stride.max())names = model.names# === 加载图像 ===img0 = cv2.imread(opt.source)assert img0 is not None, f'Image Not Found {opt.source}'# 预处理img = letterbox(img0, new_shape=opt.img_size, stride=stride)[0]img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGBimg = np.ascontiguousarray(img)img = torch.from_numpy(img).to(device)img = img.float() / 255.0if img.ndimension() == 3:img = img.unsqueeze(0)# === 推理 ===with torch.no_grad():pred = model(img)[0]# === NMS ===pred = non_max_suppression(pred,conf_thres=opt.conf_thres,iou_thres=opt.iou_thres)# === 处理检测结果 ===for i, det in enumerate(pred):if len(det):# 坐标映射回原图det[:, :4] = scale_boxes(img.shape[2:], det[:, :4], img0.shape).round()# 打印结果for *xyxy, conf, cls in reversed(det):label = f'{names[int(cls)]} {conf:.2f}'print(f'Detected: {label} at {xyxy}')# 绘制plot_one_box(xyxy, img0, label=label)# === 保存结果 ===cv2.imwrite(opt.output, img0)print(f'Results saved to {opt.output}')if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('--weights', type=str, default='yolov5s.pt')parser.add_argument('--source', type=str, default='data/images/bus.jpg')parser.add_argument('--output', type=str, default='result.jpg')parser.add_argument('--img-size', type=int, default=640)parser.add_argument('--conf-thres', type=float, default=0.25)parser.add_argument('--iou-thres', type=float, default=0.45)opt = parser.parse_args()detect(opt)

8.3 自定义数据集训练

1. 准备数据集：

dataset/
├── images/
│   ├── train/
│   │   ├── img1.jpg
│   │   └── img2.jpg
│   └── val/
│       ├── img3.jpg
│       └── img4.jpg
└── labels/├── train/│   ├── img1.txt│   └── img2.txt└── val/├── img3.txt└── img4.txt

2. 创建数据配置文件 (data.yaml)：

# 数据集路径
path: ../dataset  # 数据集根目录
train: images/train  # 训练图像路径
val: images/val      # 验证图像路径# 类别
nc: 3  # 类别数
names: ['cat', 'dog', 'bird']  # 类别名称

3. 创建模型配置文件 (custom.yaml)：

# YOLOv5 custom modelnc: 3  # 类别数
depth_multiple: 0.33  # 深度因子
width_multiple: 0.50  # 宽度因子anchors:- [10,13, 16,30, 33,23]- [30,61, 62,45, 59,119]- [116,90, 156,198, 373,326]backbone:[[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2[-1, 1, Conv, [128, 3, 2]],    # 1-P2/4[-1, 3, C3, [128]],[-1, 1, Conv, [256, 3, 2]],    # 3-P3/8[-1, 6, C3, [256]],[-1, 1, Conv, [512, 3, 2]],    # 5-P4/16[-1, 9, C3, [512]],[-1, 1, Conv, [1024, 3, 2]],   # 7-P5/32[-1, 3, C3, [1024]],[-1, 1, SPPF, [1024, 5]],      # 9]head:[[-1, 1, Conv, [512, 1, 1]],[-1, 1, nn.Upsample, [None, 2, 'nearest']],[[-1, 6], 1, Concat, [1]],[-1, 3, C3, [512, False]],[-1, 1, Conv, [256, 1, 1]],[-1, 1, nn.Upsample, [None, 2, 'nearest']],[[-1, 4], 1, Concat, [1]],[-1, 3, C3, [256, False]],  # 17 (P3/8-small)[-1, 1, Conv, [256, 3, 2]],[[-1, 14], 1, Concat, [1]],[-1, 3, C3, [512, False]],  # 20 (P4/16-medium)[-1, 1, Conv, [512, 3, 2]],[[-1, 10], 1, Concat, [1]],[-1, 3, C3, [1024, False]],  # 23 (P5/32-large)[[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)]

4. 训练命令：

python train.py --data data.yaml --cfg custom.yaml --weights yolov5s.pt --epochs 100

总结

本文档详细介绍了YOLOv5的各个方面：

核心要点

网络架构：
- Backbone：CSPDarknet53，负责特征提取
- Neck：PANet，负责多尺度特征融合
- Head：Detect，负责预测边界框和类别
关键模块：
- Focus：降低计算量的同时保留信息
- C3：CSP Bottleneck，提高效率
- SPPF：多尺度特征融合，增大感受野
- PANet：自顶向下和自底向上的特征融合
训练策略：
- 数据增强（Mosaic、MixUp等）
- 多种损失函数（CIoU、BCE）
- 学习率预热和余弦退火
- EMA和混合精度训练
推理优化：
- 非极大值抑制（NMS）
- 多尺度预测
- 后处理和可视化