深圳品牌营销策划机构seo千享科技
一、论文核心
在FPN基础上提出3个改进点(补偿融合的过程中产生的各种信息缺失):
1.Consistent Supervision:用于降低不同scale之间的语义Gap(补偿相邻特征融合后产生的语义信息损失)
2.Residual Feature Augmentation:用于在不同尺度的特征融合(fusion, summation)中降低信息损失(补偿最顶层由于融合前降维产生的信息损失)
3.Soft RoI Selection:更好地从图像金字塔中取出ROI Feature用于分类(补偿从单层金字塔中取ROI Feature产生的信息损失)
二、Consistent Supervision
通过两个卷积进行适配,相邻两个尺度的特征图直接相加的方式,没有考虑到语义信息,会使得最终的特征金字塔陷入次优。
AugFPN中直接在每个融合前的特征后面接上检测器和分类器(RPN Head + RCNN)。
训练时,网络的损失 = lambda * 融合前检测器的Localization Loss + 融合后检测器的Localization Loss + beta * (融合前检测器的Classification Loss + 融合后检测器的Classification Loss)
此外,融合前各尺度的检测器权重是共享的,这样有利于对不同尺度的监督,从而:1)进一步加强各个尺度下的特征联系; 2)反推底层信息能学到更多的语义信息(从高层信息引导过来)
在预测时,融合前共享的这些检测器&分类器都可以去掉
三、Residual Feature Augmentation
在原始FPN中M5层经过1x1卷积后通道数减少,却不像其他层一样有额外特征与其融合,AugFPN中提出Residual Feature Augmentation补偿M5层中损失的信息。如下图所示:
四、Soft RoI Selection
在 FPN 中,每个 RoI 的特征都是通过在一个特定特征级别上进行池化来获得的,该特征级别是根据该 RoI 的规模启发式选择的。通常,较小的 RoI 分配给较低级别的特征,而较大的 RoI 分配给较高级别的特征。
Soft RoI Selection中引入了自适应权重,以更好地衡量不同级别 RoI 区域内特征的重要性。最终的 RoI 特征是基于自适应权重生成的,而不是基于 RoI 分配或最大作等硬选择方法(如PANet)。
Soft RoI Selection首先汇集每个 ROI 的所有金字塔级别的特征。然后利用自适应空间融合模块 (ASF)来自适应地融合这些特征。为不同级别的 RoI 特征生成不同的空间权重图,并将 RoI 特征与加权聚合融合。如下图所示:
其代码实现如下:
@ROI_EXTRACTORS.register_module
class SoftRoIExtractor(nn.Module):"""Extract RoI features from all level feature map.If there are mulitple input feature levels, each RoI is mapped to a levelaccording to its scale.Args:roi_layer (dict): Specify RoI layer type and arguments.out_channels (int): Output channels of RoI layers.featmap_strides (int): Strides of input feature maps.finest_scale (int): Scale threshold of mapping to level 0."""def __init__(self,roi_layer,out_channels,featmap_strides,finest_scale=56):super(SoftRoIExtractor, self).__init__()self.roi_layers = self.build_roi_layers(roi_layer, featmap_strides)self.out_channels = out_channelsself.featmap_strides = featmap_stridesself.finest_scale = finest_scaleself.spatial_attention_conv=nn.Sequential(nn.Conv2d(out_channels*len(featmap_strides), out_channels, 1), nn.ReLU(), nn.Conv2d(out_channels,len(featmap_strides),3, padding=1))@propertydef num_inputs(self):"""int: Input feature map levels."""return len(self.featmap_strides)def init_weights(self):for m in self.spatial_attention_conv.modules():if isinstance(m, nn.Conv2d):xavier_init(m, distribution='uniform')def map_roi_levels(self, rois, num_levels):"""Map rois to corresponding feature levels by scales.- scale < finest_scale: level 0- finest_scale <= scale < finest_scale * 2: level 1- finest_scale * 2 <= scale < finest_scale * 4: level 2- scale >= finest_scale * 4: level 3Args:rois (Tensor): Input RoIs, shape (k, 5).num_levels (int): Total level number.Returns:Tensor: Level index (0-based) of each RoI, shape (k, )"""scale = torch.sqrt((rois[:, 3] - rois[:, 1] + 1) * (rois[:, 4] - rois[:, 2] + 1))target_lvls = torch.floor(torch.log2(scale / self.finest_scale + 1e-6))target_lvls = target_lvls.clamp(min=0, max=num_levels - 1).long()return target_lvlsdef build_roi_layers(self, layer_cfg, featmap_strides):cfg = layer_cfg.copy()layer_type = cfg.pop('type')assert hasattr(ops, layer_type)layer_cls = getattr(ops, layer_type)roi_layers = nn.ModuleList([layer_cls(spatial_scale=1 / s, **cfg) for s in featmap_strides])return roi_layersdef forward(self, feats, rois):if len(feats) == 1:return self.roi_layers[0](feats[0], rois)out_size = self.roi_layers[0].out_sizenum_levels = len(feats)roi_feats = torch.cuda.FloatTensor(rois.size()[0], self.out_channels,out_size, out_size).fill_(0)roi_feats_list = []for i in range(num_levels):roi_feats_list.append(self.roi_layers[i](feats[i], rois))concat_roi_feats = torch.cat(roi_feats_list, dim=1)spatial_attention_map = self.spatial_attention_conv(concat_roi_feats)for i in range(num_levels):roi_feats += (F.sigmoid(spatial_attention_map[:,i,None,:,:]) * roi_feats_list[i])return roi_feats
五、网络结构
其代码实现如下:
@NECKS.register_module
class HighFPN(nn.Module):def __init__(self,in_channels,out_channels,num_outs,pool_ratios=[0.1,0.2,0.3],start_level=0,end_level=-1,add_extra_convs=False,normalize=None,activation=None):super(HighFPN, self).__init__()assert isinstance(in_channels, list)self.in_channels = in_channelsself.out_channels = out_channelsself.num_ins = len(in_channels)self.num_outs = num_outsself.activation = activationself.with_bias = normalize is Noneif end_level == -1:self.backbone_end_level = self.num_insassert num_outs >= self.num_ins - start_levelelse:# if end_level < inputs, no extra level is allowedself.backbone_end_level = end_levelassert end_level <= len(in_channels)assert num_outs == end_level - start_levelself.start_level = start_levelself.end_level = end_levelself.add_extra_convs = add_extra_convsself.lateral_convs = nn.ModuleList()self.fpn_convs = nn.ModuleList()for i in range(self.start_level, self.backbone_end_level):l_conv = ConvModule(in_channels[i],out_channels,1,padding=0,normalize=normalize,bias=self.with_bias,activation=self.activation,inplace=False)fpn_conv = ConvModule(out_channels,out_channels,3,padding=1,normalize=normalize,bias=self.with_bias,activation=self.activation,inplace=False)self.lateral_convs.append(l_conv)self.fpn_convs.append(fpn_conv)# add lateral conv for features generated by rato-invariant scale adaptive poolingself.adaptive_pool_output_ratio = pool_ratiosself.high_lateral_conv = nn.ModuleList()self.high_lateral_conv.extend([nn.Conv2d(in_channels[-1], out_channels, 1) for k in range(len(self.adaptive_pool_output_ratio))])self.high_lateral_conv_attention = nn.Sequential(nn.Conv2d(out_channels*(len(self.adaptive_pool_output_ratio)), out_channels, 1),nn.ReLU(), nn.Conv2d(out_channels,len(self.adaptive_pool_output_ratio),3,padding=1))# add extra conv layers (e.g., RetinaNetextra_levels = num_outs - self.backbone_end_level + self.start_levelif add_extra_convs and extra_levels >= 1:for i in range(extra_levels):in_channels = (self.in_channels[self.backbone_end_level - 1]if i == 0 else out_channels)extra_fpn_conv = ConvModule(in_channels,out_channels,3,stride=2,padding=1,normalize=normalize,bias=self.with_bias,activation=self.activation,inplace=False)self.fpn_convs.append(extra_fpn_conv)# default init_weights for conv(msra) and norm in ConvModuledef init_weights(self):for m in self.modules():if isinstance(m, nn.Conv2d):xavier_init(m, distribution='uniform')for m in self.high_lateral_conv_attention.modules():if isinstance(m, nn.Conv2d):xavier_init(m, distribution='uniform')def forward(self, inputs):assert len(inputs) == len(self.in_channels)laterals = [lateral_conv(inputs[i + self.start_level])for i, lateral_conv in enumerate(self.lateral_convs)]#Residual Feature Augmentationh, w = inputs[-1].size(2), inputs[-1].size(3)#Ratio Invariant Adaptive PoolingAdapPool_Features = [F.upsample(self.high_lateral_conv[j](F.adaptive_avg_pool2d(inputs[-1],output_size=(max(1,int(h*self.adaptive_pool_output_ratio[j])), max(1,int(w*self.adaptive_pool_output_ratio[j]))))), size=(h,w), mode='bilinear', align_corners=True) for j in range(len(self.adaptive_pool_output_ratio))]Concat_AdapPool_Features = torch.cat(AdapPool_Features, dim=1)fusion_weights = self.high_lateral_conv_attention(Concat_AdapPool_Features)fusion_weights = F.sigmoid(fusion_weights)adap_pool_fusion = 0for i in range(len(self.adaptive_pool_output_ratio)):adap_pool_fusion += torch.unsqueeze(fusion_weights[:,i, :,:], dim=1) * AdapPool_Features[i]# for Consistent Supervision raw_laternals = [laterals[i].clone() for i in range(len(laterals))]# build top-down pathlaterals[-1] += adap_pool_fusionused_backbone_levels = len(laterals)for i in range(used_backbone_levels - 1, 0, -1):laterals[i - 1] += F.interpolate(laterals[i], scale_factor=2, mode='nearest')# build outputs# part 1: from original levelsouts = [self.fpn_convs[i](laterals[i]) for i in range(used_backbone_levels)]# part 2: add extra levelsif self.num_outs > len(outs):# use max pool to get more levels on top of outputs# (e.g., Faster R-CNN, Mask R-CNN)if not self.add_extra_convs:for i in range(self.num_outs - used_backbone_levels):outs.append(F.max_pool2d(outs[-1], 1, stride=2))# add conv layers on top of original feature maps (RetinaNet)else:orig = inputs[self.backbone_end_level - 1]outs.append(self.fpn_convs[used_backbone_levels](orig))for i in range(used_backbone_levels + 1, self.num_outs):# BUG: we should add relu before each extra convouts.append(self.fpn_convs[i](outs[-1]))return tuple(outs), tuple(raw_laternals)
六、参考内容
AugFPN: Improving Multi-scale Feature Learning for Object Detection
GitHub - Gus-Guo/AugFPN: source code of AugFPN
论文笔记AugFPN:Improving Multi-scale Feature Learning - 知乎