当前位置：首页 > news >正文

目标检测模型SSD详解与实现

news 2025/11/7 10:27:05

目标检测模型SSD详解与实现

- 0. 前言
- 1. SSD 损失函数
- 2. SSD 模型架构
- 3. 使用 Keras 实现模型架构
- 4. SSD 模型训练
- 5. 非极大值抑制算法
- 6. SSD 模型验证

0. 前言

我们已经学习了目标检测的基本原理，本节将聚焦实时目标检测技术，重点探讨基于 Keras 的 SSD (single-shot detection) 的原理与实现。相较于其他深度学习检测算法，SSD 在现代 GPU 上可实现实时检测速度，且性能无明显衰减，同时具备端到端训练的便捷特性。

1. SSD 损失函数

在 SSD 算法中，存在数千个锚框。如目标检测的基本原理一节所述，目标检测的目标是预测每个锚框的类别及其偏移量。我们可以为每项预测使用以下损失函数：

$Lcls\mathcal L_{cls}$ ：针对类别预测 $y_{cls}$ 的交叉熵损失
$Loff\mathcal L_{off}$ ：针对偏移量 $y_{off}$ 的 L1 或 L2 损失(仅正样本锚框参与 $Loff\mathcal L_{off}$ 计算，L1 损失即平均绝对误差损失，L2 损失即均方误差损失)

总损失函数为：
$L=Loff+Lcls\mathcal L = \mathcal L_{off} + \mathcal L_{cls}$
网络对每个锚框预测以下输出：

$y_cls$ ：以独热向量形式表示的类别
$y_off$ ：以像素坐标形式表示的偏移量，即 $x_{omin}, y_{omin}), (x_{omax}, y_{omax}))$ ，这些偏移量是相对于锚框的坐标值

为便于计算，偏移量更适合表示为：
$y_{off} = ((x_{omin}, y_{omin}), (x_{omax}, y_{omax}))$
SSD 是一种监督式目标检测算法，其训练需要以下真实标注数据：

$y_{label}$ ：待检测物体的类别标签
$y_{gt}$ ：真实偏移量，计算公式为：
$y_{gt} = (x_{bmin} - x_{amin}, x_{bmax} - x_{amax}, y_{bmin} - y_{amin}, y_{bmax} - y_{amax})$

换言之，真实偏移量计算的是物体边界框相对于锚框的偏移值。为清晰起见，此处对 $y_{box}$ 的下标进行了微调。如目标检测一节所述，这些真实值通过 get_gt_data() 函数计算得出。
然而 SSD 不建议直接预测原始像素误差值 $y_{off}$ ，而是采用归一化偏移量。首先将真实边界框和锚框坐标转换为中心点-维度格式：
$ybox=((xbmin,ybmin),(xbmax,ybmax))→(cbx,cby,wb,hb)yanchor=((xamin,yamin),(xamax,yamax))→(cax,cay,wa,ha)y_{box} = ((x_{bmin}, y_{bmin}), (x_{bmax}, y_{bmax})) \rightarrow (c_{bx}, c_{by}, w_b, h_b)\\ y_{anchor} = ((x_{amin}, y_{amin}), (x_{amax}, y_{amax})) \rightarrow (c_{ax}, c_{ay}, w_a, h_a)$
其中：
$(cbx,cby)=(xmin+xmax−xmin2,ymin+ymax−ymin2)(c_{bx}, c_{by}) = (x_{min} +\frac {x_{max}-x_{min}}{2}, y_{min} + \frac {y_{max}-y_{min}}{2})$
表示边界框中心点坐标；
$w_b, h_b) = (x_{max} - x_{min}, y_{max} - y_{min})$
分别对应宽度和高度。锚框遵循相同转换规则。归一化后的真实偏移量表示为：
$ygt=(cbx−caxwa,cby−cayha,logwbwa,loghbha)y_{gt} = (\frac {c_{bx} - c_{ax}}{w_a}, \frac {c_{by} - c_{ay}}{h_a}, log\frac {w_b}{w_a}, log\frac {h_b}{h_a})$
通常情况下， $y_{gt}$ 各元素的值域较小 ( $∣∣ygt∣∣≪1.0||y_{gt}|| \ll1.0$ )，这种小梯度值可能增加网络训练收敛的难度。
为缓解此问题，可将各元素除以其估计标准差。修正后的真实偏移量表示为：

$ygt=(cbx−caxwaσx,cby−cayhaσy,logwbwaσw,loghbhaσh)y_{gt} = (\frac{\frac {c_{bx} - c_{ax}}{w_a}}{\sigma_x}, \frac{\frac {c_{by} - c_{ay}}{h_a}}{\sigma_y}, \frac{log\frac {w_b}{w_a}}{\sigma_w}, \frac{log\frac {h_b}{h_a}}{\sigma_h})$
建议参数取值为： $σx=σy=0.1\sigma_x = \sigma_y = 0.1$ ， $σw=σh=0.2\sigma_w = \sigma_h = 0.2$ 。这意味着坐标轴方向的像素误差预期范围为 ±10%，而宽高误差范围为 ±20%。需要说明的是，这些取值具有一定任意性。

def mask_offset(y_true,y_pred):#1st 4 are offsetsoffset = y_true[...,0:4]#last 4 are maskmask = y_true[...,4:8]#pred is actually duplicated for alignment#either we get the 1st or last 4 offset pred#and apply the maskpred = y_pred[...,0:4]offset *= maskpred *= maskreturn offset, preddef l1_loss(y_true,y_pred):offset, pred = mask_offset(y_true,y_pred)#we can use l1return keras.backend.mean(keras.backend.abs(pred - offset), axis=-1)def smooth_l1_loss(y_true,y_pred):offset,pred = mask_offset(y_true,y_pred)#huber loss as approx of smooth l1return keras.losses.Huber()(offset,pred)

此外，针对偏移量预测的损失函数，SSD 受 Fast-RCNN 启发采用平滑 L1 损失替代原始 L1 损失：
$Loff=L1smooth(u)={(σu)22if∣u∣<1σ2∣u∣−1(2σ2)otherwise\mathcal L_{off} =L1_{smooth}(u)=\begin{cases} \frac {(\sigma u)^2}2 & if\ |u| < \frac 1{\sigma ^2} \\ |u| - \frac 1{(2\sigma ^2)} & otherwise \end{cases}$
其中 $u$ 代表真实值与预测值之间的误差元素：
$u = y_{gt} - y_{pred}$
平滑 L1 损失相比 L1 损失具有更强的鲁棒性，对异常值不敏感。在 SSD 中设定 $σ=1\sigma=1$ 。当 $σ→∞\sigma \rightarrow \infty$ 时，平滑 L1 损失趋近于 L1 损失。mask_offset() 方法确保仅对具有真实边界框的预测计算偏移损失。当 $σ = 1$ 时，平滑 L1 函数与 Huber 损失等效。
作为损失函数的进一步改进，RetinaNet 建议将类别预测的交叉熵损失 CE 替换为焦点损失 FL：
$Lcls=CE=−∑iyilogpiLcls=FL=−α∑iyi(1−pi)γlogpi\mathcal L_{cls} = CE = -\sum_i y_i logp_i\\ \mathcal L_{cls} = FL = -\alpha \sum_i y_i (1-p_i)^γ logp_i$
关键差异在于引入了额外因子 $α(1−pi)γ\alpha(1-p_i)^γ$ 。RetinaNet 实验表明，当 $γ = 2$ 且 $α = 0.25$ 时目标检测效果最优。

def focal_loss_categorical(y_true,y_pred):gamma = 2.0alpha = 0.25#scale to ensure sum of prob is 1.0y_pred /= keras.backend.sum(y_pred,axis=-1,keepdims=True)#clip the prediction value to prevent Nan and Infepsilon = keras.backend.epsilon()y_pred = keras.backend.clip(y_pred,epsilon,1.-epsilon)#calculate cross entropycross_entropy = -y_true * keras.backend.log(y_pred)#calculate focal lossweight = alpha * keras.backend.pow(1 - y_pred,gamma)cross_entropy *= weightreturn keras.backend.sum(cross_entropy,axis=-1)

焦点损失的设计动机在于：当我们分析图像时，大部分锚框应被归类为背景(即负样本锚框)，仅有少数正样本锚框能有效代表目标物体。交叉熵损失的主要贡献者正是这些负样本锚框，导致优化过程中正样本锚框的贡献被负样本锚框所压制。这种现象被称为类别不平衡问题，即某个或某几个类别在数量上占据主导地位。采用焦点损失后，在优化初期我们就能确信负样本锚框属于背景类别。因此当 $pi→1.0p_i\rightarrow1.0$ 时， $1-p_i)^γ$ 项会降低负样本锚框的损失贡献。而对于正样本锚框，由于 $p_i$ 远离 $1.0$ ，其损失贡献仍保持显著。

2. SSD 模型架构

SSD 网络接收 RGB 图像输入，并输出多层级预测结果。基础网络(或称骨干网络)负责为下游的分类和偏移预测任务提取特征。骨干网络之后的目标检测任务由剩余网络(称为 SSD 头部)完成。

SSD

骨干网络可采用预训练网络(如经 ImageNet 分类训练的模型)并冻结其权重，也可与目标检测任务联合训练。使用预训练基础网络能充分利用先前从大数据集中学习到的特征提取滤波器，同时由于骨干网络参数被冻结，能加速训练过程——仅需训练目标检测的顶层网络。在本节，我们采用与目标检测任务联合训练的方式。
骨干网络通常通过步幅为 2 的卷积或最大池化进行多轮下采样。以 ResNet50 为例，共进行 4 次下采样，最终特征图尺寸变为 $(w16,h16)(\frac w {16}, \frac h{16})$ 。例如 640×480 的图像经处理后生成 40×30=1200 个特征图单元，该数值需乘以每个锚框的尺寸种类数，因宽高比差异共有 6 种尺寸，另为 1:1 比例额外增加一种尺寸。
本节将宽高比限制为 $αi∈{0,1,3}=1,2,12\alpha_{i∈\{0,1,3\}}=1,2,\frac 1 2$ ，因此仅产生4种不同尺寸。对于 640×480 图像，第一组锚框总数 $n_1=4800$ 个。
第一组预测将产生庞大数量的预测结果，形成大量图像块。每个锚框均需预测类别和偏移量，总计 $n_1$ 个类别预测和 $n_1$ 个偏移量预测。独热类别预测的维度等于待检测物体类别数加 1 (背景类)，每个偏移量预测维度为 4，对应预测边界框两个角点的 $(x, y)$ 偏移量。
类别预测器由卷积层加 softmax 激活层构成，用于计算交叉熵损失；偏移量预测器则是采用线性激活的独立卷积层。
骨干网络后可附加特征提取模块，每个特征提取器采用 Conv2D-BN-ELU 结构。经特征提取块处理后，特征图尺寸减半而滤波器数量翻倍。例如骨干网络后的首个特征提取块将生成 20×15×2n_filters 特征图，基于此产生 $n_2=20×15×4=1200$ 个类别预测和偏移量预测。
通过持续添加带有分类和偏移预测器的特征提取模块，可以进一步扩展网络结构。对于 640×480 图像，最终可生成 2×1×25n_filters 的特征图，产生 $n_6=2×1×4=8$ 个分类和偏移预测，这对应着 6 层特征提取与预测模块。经过 6 个模块处理后，640×480 图像对应的锚框预测总数达到 9648 个。
为便于讨论，将锚框缩放因子按递减顺序排列：
$[(12,1),(13,12),(15,14),(110,18),(120,115),(140,130)][(\frac1 2,1), (\frac 1 3,\frac 1 2), (\frac 1 5,\frac 1 4), (\frac 1{10},\frac 1 8), (\frac 1{20},\frac 1{15}), (\frac 1{40},\frac 1{30})]$
但需要明确：缩放因子实际起始于骨干网络输出的特征图尺寸。实践中应按递增顺序排列：
$[(140,130),(120,115),(110,18),(15,14),(13,12),(12,1)][(\frac 1{40},\frac 1{30}), (\frac 1{20},\frac 1{15}), (\frac 1{10},\frac 1 8), (\frac 1 5,\frac 1 4), (\frac 1 3,\frac 1 2), (\frac 1 2,1)]$
这意味着若将特征提取模块减少至 4 个，对应的缩放因子为：
$[(140,130),(120,115),(110,18),(15,14)][(\frac 1 {40},\frac 1 {30}), (\frac 1{20},\frac 1{15}), (\frac 1{10},\frac 1 8), (\frac 1 5,\frac 1 4)]$
当特征图宽度或高度不能被 2 整除时，需应用向上取整函数。但在原始 SSD 实现中，缩放因子被简化为在 [0.2, 0.9] 区间内根据缩放因子数量(即特征提取层数 n_layers )进行线性缩放。

s = np.linspace(0.2, 0.9, n_layers + 1)

3. 使用 Keras 实现模型架构

本节代码着重阐释多尺度目标检测的核心概念，代码实现的某些部分可进一步优化，例如对真实锚框类别、偏移量和掩码进行缓存处理。在本节中，每次从文件系统加载图像时都会通过独立线程计算真实值。
SSD 对象负责构建、训练和评估 SSD 模型，模型采用 ResNet 作为骨干网络，SSD 使用的数据集由数千张高分辨率图像组成。多线程数据生成器负责从文件系统加载并队列化这些图像，同时计算锚框的真实标签。若没有多线程数据生成器，训练过程中的图像加载、队列管理以及真实值计算将变得极其缓慢。
此外还存在许多关键的后台例程，这些例程功能包括：创建锚框、计算交并比、建立真实标签、执行非极大值抑制、绘制标签与边界框、在视频帧上显示检测结果、提供损失函数等。

class SSD:def __init__(self,args):self.args = argsself.ssd = Noneself.train_generator = Noneself.build_model()def build_model(self):#store in a dictionary the list of image files and labelsself.build_dictionary()#input shape is (480,640,3） by defaultself.input_shape = (self.args.height,self.args.width,self.args.channels)#build the backbone network (eg ResNet50)#the number of feature layers is equal to n_layers#feature layers are inputs to ssd network heads#for class and offsets predictionsself.backbone = self.args.backbone(self.input_shape,n_layers=self.args.layers)#using the backbone, build ssd network#outputs of ssd are class and offsets predictionsanchors, features, ssd = build_ssd(self.input_shape,self.backbone,n_layers=self.args.layers,n_classes=self.n_classes)#n_anchors = num of anchors per feature point (eg 4)self.n_anchors = anchors#feature_shapes is a list of feature map shapes#per output layer - used for computing anchor boxes sizesself.feature_shapes = features#ssd network modelself.ssd = ssddef build_dictionary(self):#train dataset pathpath = os.path.join(self.args.data_path,self.args.train_labels)#build dictionary#key=image filename, value=box coords + class label#self.classes is a list of class labelsself.dictionary, self.classes = build_label_dictionary(path)self.n_classes = len(self.classes)self.keys = np.array(list(self.dictionary.keys()))def build_generator(self):self.train_generator = DataGenerator(args=self.args,dictionary=self.dictionary,n_classes=self.n_classes,feature_shapes=self.feature_shapes,n_anchors=self.n_anchors,shuffle=True)def train(self):#build the train data generatorif self.train_generator is None:self.build_generator()optimizer = keras.optimizers.Adam(lr=1e-3)#choice of loss funcion via argsif self.args.improved_loss:print_log("Focal loss and smooth L1", self.args.verbose)loss = [focal_loss_categorical,smooth_l1_loss]elif self.args.smooth_l1:print_log("Smooth L1",self.args.verbose)loss = ['categorical_crossentropy',smooth_l1_loss]else:print_log("Crossentropy ans L1",self.args.verbose)loss = ['categorical_crossentropy',l1_loss]self.ssd.compile(optimizer=optimizer,loss=loss)#model weights are saved for feature validation#prepare model saving directory.save_dir = os.path.join(os.getcwd(),self.args.save_dir)model_name = self.backbone.namemodel_name += '-' + str(self.args.layers) + 'layer'if self.args.normalize:model_name += '-norm'if self.args.improved_loss:model_name += '-improved_loss'elif self.args.smooth_l1:model_name += '-smooth_l1'if self.args.threshold < 1.0:model_name += '-extra_anchors'model_name += '-'model_name += self.args.datasetmodel_name += '-{epoch:03d}.h5'log = '# of classes %d' % self.n_classesprint_log(log,self.args.verbose)log = 'Batch size: %d' % self.args.batch_sizeprint_log(log,self.args.verbose)log = 'Weights filename: %s' % model_nameprint_log(log,self.args.verbose)if not os.path.isdir(save_dir):os.makedirs(save_dir)filepath = os.path.join(save_dir,model_name)#prepare callbacks for saving model weights#and learning rate scheduler#learning rate decreases by 50% every 20 epochs#after 60th epothcheckpoint = keras.callbacks.ModelCheckpoint(filepath=filepath,verbose=1,save_weights_only=True)scheduler = keras.callbacks.LearningRateScheduler(lr_scheduler)callbacks = [checkpoint,scheduler]#training the ssd networkself.ssd.fit(self.train_generator,use_multiprocessing=True,callbacks=callbacks,epochs=self.args.epochs)

SSD 模型创建函数 build_ssd() 通过调用 base_outputs = boundary(inputs) 从骨干网网络检索 n_layers 层输出特征。
返回值 base_outputs 是作为类别和偏移量预测层输入的特征输出列表。例如，第一个输出 base_outputs[0] 用于生成 $n_1$ 个类别预测和 $n_2$ 个偏移量预测。
在 build_ssd() 的循环结构中，类别预测对应 classes 变量，偏移量预测对应 offsets 变量。循环迭代结束后，所有类别预测将被拼接并最终合并为维度如下的 classes 变量：(批处理大小, 锚框总数, 类别数量)
偏移量变量也遵循相同的处理流程，最终维度为：(批处理大小, 锚框总数, 4)
其中批处理大小指 mini-batch 的样本数量，锚框总数指所有锚框的数量。循环迭代次数等于 n_layers，这个数值也对应着所需的锚框缩放因子数量，即 SSD 头部的特征提取块数量。
build_ssd() 函数最终返回每个特征点对应的锚框数量、类别预测前的特征图形状、偏移量预测层参数以及 SSD 模型本身。

def build_ssd(input_shape,backbone,n_layers=4,n_classes=4,aspect_ratios=(1,2,0.5)):#number of anchor boxes per feature map ptn_anchors = len(aspect_ratios) + 1inputs = keras.layers.Input(shape=input_shape)#no. of base_outputs depends on n_layersbase_outputs = backbone(inputs)outputs = []feature_shapes = []out_cls = []out_off = []for i in range(n_layers):#ench conv layer from backbone is used#as feature maps for class and offset predictions#also known as multi-scale predictionsconv = base_outputs if n_layers==1 else base_outputs[i]name = 'cls' + str(i+1)classes = conv2d(conv,n_anchors*n_classes,kernel_size=3,name=name)#offsets: (batch,height,width,n_anchors*4)name = 'off' + str(i + 1)offsets = conv2d(conv,n_anchors*4,kernel_size=3,name=name)shape = np.array(keras.backend.int_shape(offsets))[1:]feature_shapes.append(shape)#reshape the class predictions, yielding 3D tensors of #shape (batch, height * width * n_anchors, n_classes)#last axis to perform softmax on themname = 'cls_res' + str(i+1)classes = keras.layers.Reshape((-1,n_classes),name=name)(classes)#reshape the offset predictions, yielding 3D tensors of#shape (batch, height * width * n_anchors, 4)#last axis to compute the (smooth) L1 or L2 lossname = 'off_res' + str(i+1)offsets = keras.layers.Reshape((-1,4),name=name)(offsets)#concat for alignment with ground truth size#made of ground truth offsets and mask of same dim#needed during loss computationoffsets = [offsets,offsets]name = 'off_cat' + str(i+1)offsets = keras.layers.Concatenate(axis=-1,name=name)(offsets)#collect offset prediction per scaleout_off.append(offsets)name = 'cls_out' + str(i+1)#activation = 'sigmoid' if n_classes==1 else 'softmax'#print("Activation:", activation)classes = keras.layers.Activation('softmax',name=name)(classes)#collect class prediction per scaleout_cls.append(classes)if n_layers > 1:#concat all class and offset from each scalename = 'offsets'offsets = keras.layers.Concatenate(axis=1,name=name)(out_off)name = 'classes'classes = keras.layers.Concatenate(axis=1,name=name)(out_cls)else:offsets = out_off[0]classes = out_cls[0]outputs = [classes,offsets]model = keras.Model(inputs=inputs,outputs=outputs,name='ssd_head')return n_anchors, feature_shapes, model

SSD 目标检测模型需要大量标注过的高分辨率图像，SSD 采用了多线程数据生成器。该多线程生成器的任务是加载多批次的图像及其对应标签。通过多线程机制，当一个线程向 GPU 输送数据时，其他 CPU 线程可在队列中准备下一批数据，或从文件系统加载图像并计算真实值，从而确保 GPU 持续处于工作状态。
DataGenerator 类继承自 Keras 的 Sequence 类以确保支持多进程处理。该类能保证每个训练周期完整遍历整个数据集。通过 __len__() 方法返回指定批次大小下的每个训练周期总批次数。每个小批次数据的请求由 __getitem__() 方法实现。在每个训练周期结束后，若 self.shuffle 参数为 True，则会调用 on_epoch_end() 方法对所有批次数据进行乱序处理。

class DataGenerator(Sequence):def __init__(self,args,dictionary,n_classes,feature_shapes=[],n_anchors=4,shuffle=True):self.args = argsself.dictionary = dictionaryself.n_classes = n_classesself.keys = np.array(list(self.dictionary.keys()))self.input_shape = (args.height,args.width,args.channels)self.feature_shapes = feature_shapesself.n_anchors = n_anchorsself.shuffle = shuffleself.on_epoch_end()self.get_n_boxes()def __len__(self):blen = np.floor(len(self.dictionary) / self.args.batch_size)return int(blen)def __getitem__(self,index):start_index = index * self.args.batch_sizeend_index = (index + 1) * self.args.batch_sizekeys = self.keys[start_index:end_index]x,y = self.__data_generation(keys)return x,ydef on_epoch_end(self):if self.shuffle == True:np.random.shuffle(self.keys)def get_n_boxes(self):self.n_boxes = 0for shape in self.feature_shapes:self.n_boxes += np.prod(shape) // self.n_anchorsreturn self.n_boxes

数据生成器的核心工作由 __data_generation() 方法完成，该方法针对每个小批次数据执行以下操作：

通过 imread() 从文件系统读取图像
使用 labels = self.dictionary[key] 访问字典中存储的边界框和类别标签(前 4 个元素为边界框偏移量，最后 1 个为类别标签)
调用 anchor_boxes() 生成锚框
使用 iou() 计算每个锚框与真实边界框的交并比
通过 get_gt_data() 为每个锚框分配真实类别和偏移量

该方法还包含数据增强功能(如添加随机噪声、强度重缩放和曝光调整等)。__data_generation() 最终返回输入输出对 (x,y)，其中张量 x 存储输入图像，张量 y 将类别、偏移量和掩码捆绑在一起。

    def __data_generation(self,keys):#train input datax = np.zeros((self.args.batch_size,*self.input_shape))dim = (self.args.batch_size,self.n_boxes,self.n_classes)#class ground truthgt_class = np.zeros(dim)dim = (self.args.batch_size,self.n_boxes,self.n_anchors)#offsets ground thruthgt_offset = np.zeros(dim)#mask of valid bounding boxesgt_mask = np.zeros(dim)for i,key in enumerate(keys):#images are assumed to be stored in self.args.data_path#key is the image filenameimage_path = os.path.join(self.args.data_path,key)image = skimage.img_as_float(imread(image_path))#assign image to a batch indexx[i] = image#a label entry is made of 4-dim bounding box coords#a 1-dim class labellabels = self.dictionary[key]labels = np.array(labels)#4 bounding box coords are 1st for items of labels#last item is object class labelboxes = labels[:,0:-1]for index,feature_shape in enumerate(self.feature_shapes):#generate anchor boxesanchors = anchor_boxes(feature_shape,image.shape,index=index,n_layers=self.args.layers)#each feature layer has a row of anchor boxesanchors = np.reshape(anchors,[-1,4])#compute IoU of each bounding boxes#with respect to each bounding boxesiou = layer_utils.iou(anchors,boxes)#generate ground truth class, offsets & maskgt = get_gt_data(iou,n_classes=self.n_classes,anchors=anchors,labels=labels,normalize=self.args.normalize,threshold=self.args.threshold)gt_cls,gt_off,gt_msk = gtif index == 0:cls = np.array(gt_cls)off = np.array(gt_off)msk = np.array(gt_msk)else:cls = np.append(cls,gt_cls,axis=0)off = np.append(off,gt_off,axis=0)msk = np.append(msk,gt_msk,axis=0)gt_class[i] = clsgt_offset[i] = offgt_mask[i] = msky = [gt_class,np.concatenate((gt_offset,gt_mask),axis=-1)]return x,y

4. SSD 模型训练

通过执行以下命令，将 SSD 模型训练 200 个 epochs：

$ python ssd-11.6.1.py --train

默认批大小 --batch-size = 4 可以根据 GPU 内存进行调整。--train 表示模型训练选项。
为支持边界框偏移量的归一化处理，增加了 --normalize 选项。若需使用改进的损失函数，可添加 --improved_loss 选项。如果仅需使用平滑 L1 损失(不使用焦点损失)，则通过 --smooth-l1 参数指定。具体示例如下：

L1，无归一化：
```
python ssd-11.1.1.py –-train
```

改进的损失函数，无归一化：

python ssd-11.1.1.py –-train --improved-loss

改进的损失函数，归一化：

python ssd-11.1.1.py –-train –improved-loss –normalize

平滑 L1，归一化：

python ssd-11.1.1.py –-train –-smooth-l1 --normalize

完成 SSD 网络训练后，我们还需解决一个问题：如何处理同一物体的多个预测边界框？在测试训练好的模型之前，我们将先讨论非极大值抑制 (Non-Maximum Suppression, NMS) 算法。

5. 非极大值抑制算法

模型训练完成后，网络会预测出边界框偏移量及对应类别。在某些情况下，两个或多个边界框可能指向同一物体，造成重复预测。为消除冗余预测，需要调用非极大值抑制 (Non-Maximum Suppression, NMS) 算法。本节将涵盖经典 NMS 和软 NMS 两种算法，这两种算法均以已知边界框及对应置信度分数或概率为前提。
在经典 NMS 算法中，最终边界框根据概率值进行筛选，并存储在列表 $D$ 中，其对应得分保存在列表 $S$ 中。所有边界框及对应概率初始存储在列表 $B$ 和 $P$ 中。算法将以具有最大得分 $p_m$ 的边界框 $b_m$ 作为参考基准。
该参考边界框会被加入最终选定列表 $D$ ，同时从 $B$ 中移除；其得分也会同步加入 $S$ 并从 $P$ 中删除。对于剩余边界框，若其与 $b_m$ 的交并比大于等于设定阈值 $N_t$ ，则从 $B$ 中移除，对应得分也从 $P$ 中删除。这一过程会消除所有得分较低的重叠边界框。
当所有剩余边界框检测完毕后，算法重复执行，直至 $B$ 列表为空，最终返回选定的边界框 $D$ 及其得分 $S$ 。
经典 NMS 的缺陷在于：若某个边界框包含其他物体，但与 $b_m$ 存在显著交并比时，会被直接剔除。软 NMS 提出改进方案：不直接移除重叠边界框，而是按其与 $b_m$ 交并比的平方负指数比例降低其得分。这使重叠边界框获得二次判定机会——交并比较小的边界框在后续迭代中保留概率更高，可能被证实确实包含不同于 $b_m$ 的物体。
软 NMS 可直接替代经典 NMS，无需重新训练 SSD 网络。实践表明，软 NMS 相比经典 NMS 具有更高的平均精度。
接下来实现两种 NMS。除返回最终边界框和得分外，还会返回对应物体类别。代码实现了 NMS 的提前终止机制：当剩余边界框的最大得分低于特定阈值时立即终止计算。

def nms(args,classes,offsets,anchors):#get all non-zero (non-background) objectsobjects = np.argmax(classes, axis=1)#non-zero indexes are not backgroundnonbg = np.nonzero(objects)[0]#D ans S indexes in line 1indexes = []while True:#list of zero probability valuesscores = np.zeros((classes.shape[0],))#set probability values of non-backgroundscores[nonbg] = np.amax(classes[nonbg],axis=1)#max probability given the list#Lines 3 and 4score_idx = np.argmax(scores,axis=0)score_max = scores[score_idx]#get all non max probability & set it as new nonbg#Line 5nonbg = nonbg[nonbg != score_idx]# if max obj probability is less than threshold (def 0.8)if score_max < args.class_threshold:#we are donebreak#Line 5indexes.append(score_idx)score_anc = anchors[score_idx]score_off = offsets[score_idx][0:4]score_box = score_anc + score_offscore_box = np.expand_dims(score_box,axis=0)nonbg_copy = np.copy(nonbg)#get all overlapping predictions (Line 6)#perform Non-Max Supperssion (NMS)for idx in nonbg_copy:anchor = anchors[idx]offset = offsets[idx][0:4]box = anchor + offsetbox = np.expand_dims(box,axis=0)iou = layer_utils.iou(box,score_box)[0][0]#if soft NMS is chosen (Line 7)if args.soft_nms:#adjust score: Line 8iou = -2 * iou * iouclasses[idx] *= math.exp(iou)#else NMS (Line 9), (iou threshold def 0.2)elif iou >= args.iou_threshold:#remove overlapping predictions with iou>threshold#Line 10nonbg = nonbg[nonbg != idx]#Line 2, nothing else to processif nonbg.size == 0:break#get the array of object scoresscores = np.zeros((classes.shape[0],))scores[indexes] = np.amax(classes[indexes],axis=1)return objects,indexes,scores

6. SSD 模型验证

完成 200 个训练周期后，可对 SSD 模型性能进行验证。评估采用三项指标：交并比、精确率和召回率。
首项指标是平均交并比 (mean IoU, mIoU)。在测试集上计算真实边界框与经 NMS 处理后的预测边界框之间的 IoU，所有 IoU 的平均值即为 mIoU：
$\frac 1 {n_{box}} \times \sum_{𝑖\in \{1,2,...,n_{box}\}}\underset {j\in\{1,2,...,n_{pred}\}}{max}IoU(b_i,d_j)$
其中 $n_{box}$ 表示真实边界框 $b_i$ 的数量， $n_{pred}$ 表示预测边界框 $d_j$ 的数量。
第二项指标是精确率，计算方式为正确预测的物体类别数(真阳性，TP) 除以正确预测数 (TP) 与错误预测数(假阳性，FP) 之和。该指标衡量 SSD 模型准确识别图像中物体的能力，数值越接近 1.0 性能越优：
$\frac {TP}{TP + FP}$
第三项指标是召回率，计算方式为正确预测的物体类别数(真阳性，TP) 除以正确预测数 (TP) 与漏检物体数(假阴性，FN) 之和。该指标衡量 SSD 模型避免漏检图像中物体的能力，数值越接近 1.0 性能越优：
$\frac {TP}{TP + FN}$
若对测试集中所有图像取平均值，则称为平均精确率与平均召回率。在目标检测领域，通常通过不同交并比阈值下的精确率-召回率曲线来评估性能。为简化计算，我们仅针对特定类别阈值计算这些指标值。可通过以下命令验证模型性能：

无归一化：

python ssd.py --restore-weights=ResNet56v2-4layer-extra_anchors-drinks-200.h5 --evaluate

无归一化，平滑 L1：

python ssd.py --restore-weights=ResNet56v2-4layer-smooth_l1-extra_anchors-drinks-200.h5 --evaluate

使用归一化：

python3 ssd.py --restore-weights=ResNet56v2-4layer-norm-extra_anchors-drinks-200.h5 --evaluate --normalize

使用归一化，平滑 L1：

python ssd.py --restore-weights=ResNet56v2-4layer-norm-smooth_l1-extra_anchors-drinks-200.h5 --evaluate --normalize

使用归一化，平滑 L1，focal 损失：

python ssd.py --restore-weights=ResNet56v2-4layer-norm-improved_loss-extra_anchors-drinks-200.h5 --evaluate --normalize

在平均交并比指标上，无归一化偏移选项表现最佳，而归一化偏移设置则实现了最高的平均精确率和召回率。考虑到训练数据集仅包含 1000 张图像且未应用数据增强技术，当前性能未达到最优水平符合预期。
从结果来看，使用改进的损失函数(包括平滑L1损失、焦点损失或两者结合)时模型性能反而有所下降。可通过以下命令执行图像目标检测：

python ssd-11.6.1.py --restore-weights=<权重文件> --image-file=<目标图像文件> --evaluate

例如，对数据集中的 dataset/drinks/0010050.jpg 进行目标检测：

python ssd-11.6.1.py --restore-weights=ResNet56v2-4layer-extra_anchors-drinks-200.h5 --image-file=dataset/drinks/0010050.jpg --evaluate

    def evaluate(self,image_file=None,image=None):"""Evaluate image based on image (np tensor) or filename"""show = Falseif image is None:image = skimage.img_as_float(imread(image_file))show = Trueimage,classes,offsets = self.detect_object(image)class_names, rects,_,_ = show_boxes(self.args,image,classes,offsets,self.feature_shapes,show=show)return class_names,rectsdef evaluate_test(self):#test labels csv pathpath = os.path.join(self.args.data_path,self.args.test_labels)#test dictionarydictionary,_ = build_label_dictionary(path)keys = np.array(list(dictionary.keys()))#sum of precisions_precision = 0#sum of recalls_recall = 0#sum of IoUss_iou = 0#evaluate per imagefor key in keys:#ground truth labelslabels = np.array(dictionary[key])#4 boxes coords are 1st four items of labelsgt_boxes = labels[:,0:-1]#last one is classgt_class_ids = labels[:,-1]#load image id by keyimage_file = os.path.join(self.args.data_path,key)image = skimage.img_as_float(imread(image_file))image,classes,offsets = self.detect_object(image)#perform nms_,_,class_ids,boxes = show_boxes(self.args,image,classes,offsets,self.feature_shapes,show=False)boxes = np.reshape(np.array(boxes),(-1,4))#compute IoUsiou = layer_utils.iou(gt_boxes,boxes)#skip empty IoUsif iou.size == 0:continue#the class of predicted box w/ max ioumaxiou_class = np.argmax(iou,axis=1)#true positivetp = 0#false positivefp = 0#sum of objects iou per images_image_iou = []for n in range(iou.shape[0]):#ground truth bbox has a labelif iou[n,maxiou_class[n]] > 0:s_image_iou.append(iou[n,maxiou_class[n]])#true positive has the same class and gtif gt_class_ids[n] == class_ids[maxiou_class[n]]:tp += 1else:fp += 1#objects that we missed (false negative)fn = abs(len(gt_class_ids) - tp)s_iou += (np.sum(s_image_iou) / iou.shape[0])s_precision += (tp / (tp + fp))s_recall += (tp / (tp + fn))n_test = len(keys)print_log("mIoU: %f" % (s_iou / n_test),self.args.verbose)print_log("Precision: %f" % (s_precision / n_test),self.args.verbose)print_log("Recall: %f" % (s_recall / n_test),self.args.verbose)