当前位置: 首页 > news >正文

Ascend DrivingSDK 中的 modulated_deform_conv2d(一)

Ascend DrivingSDK 是基于昇腾 NPU 平台开发的适用于自动驾驶场景的算子和模型加速库,提供了一系列高性能的算子和模型加速接口,支持 PyTorch 框架。

Ascend DrivingSDK 中的 modulated_deform_conv2d 是少有的融合算子,使用单个 kernel 完成 Deformable Convolution 的计算。然而由于910B 采用 vector core 和 cube core 分离架构,二者间的同步开销较大。910B 系列芯片拥有高达96MB 到192MB 的 L2缓存,并且在默认情况下开启。因此,modulated_deform_conv2d 算子的输入基本都在 L2缓存上。

modulated_deform_conv2d 在 Ascend C 层面有两个算子,v2针对3x3卷积进行了优化。奇怪的是这里没有选择新增 kernel 而是新增算子。不知出于何种原因,两个版本的算子参数列表顺序不同。

opPreProcessComputeWeightComputeBilinearInterpolationProcessCubeOutput
deformable_conv2d缓存一行的卷积窗口索引,复用 HoH_oHo向量计算 WokhkwW_o k_h k_wWokhkw 个点的权重单次加载 4Ci4C_i4Ci 数据插值单次计算 WoW_oWo 的结果暂存 im2col 用于 ∂L∂W\frac{\partial L}{\partial W}WL
deformable_conv2d_v2/向量计算 khkwk_h k_wkhkw 个点的权重khkwk_h k_wkhkw 条加载 CiC_iCi 的指令然后插值单次计算128个输出点/

v2 性能比 v1好,原因应该是 v2中内存拷贝的同步间隔更长(9:4)。全局内存访问的延迟很高,v1单条指令拷贝 4Ci4C_i4Ci,v2单次加载 CiC_iCi,但是9条指令后才同步。v1 缓存 im2col 节省计算,但是又没有省太多。因为计算 ∂L∂Δmn\frac{\partial L}{\partial \Delta \mathbf{m}_n}ΔmnL 时仍然需要双线性插值的结果。

两个算子功能不完备,例如不支持 half 类型、不支持 deform_group、不支持 bias 等。v2算子更为简陋。在此情况下无文档描述和防呆,使用时不免要费一些周折。不知出于何种原因,两个版本的算子参数列表顺序不同。难以想象这是工业级的代码,遑论车规。唯一的优点是像 AMD 一样开源,期待用户自己定位解决。

ModulatedDeformConv2dFunction

将输入转为 NHWC 格式。

class ModulatedDeformConv2dFunction(Function):@staticmethod@custom_fwd(cast_inputs=torch.float32)# pylint: disable=huawei-too-many-argumentsdef forward(ctx,x: torch.Tensor,offset: torch.Tensor,mask: torch.Tensor,weight: torch.Tensor,bias: Optional[nn.Parameter] = None,stride: Union[int, Tuple[int, ...]] = 1,padding: Union[int, Tuple[int, ...]] = 0,dilation: Union[int, Tuple[int, ...]] = 1,groups: int = 1,deformable_groups: int = 1,):ctx.kernel_size = [weight.size(2), weight.size(3)]ctx.stride = _pair(stride)ctx.padding = _pair(padding)ctx.dilation = _pair(dilation)ctx.groups = groupsctx.deformable_groups = deformable_groupsnhwc_x = x.permute(0, 2, 3, 1).contiguous()nhwc_offset = offset.permute(0, 2, 3, 1).contiguous()nhwc_weight = weight.permute(0, 2, 3, 1).contiguous()nhwc_mask = mask.permute(0, 2, 3, 1).contiguous()out, offset_output = mx_driving._C.modulated_deformable_conv2d(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)ctx.save_for_backward(nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output)return out

ModulatedDeformConv2dFunction.backward

∂L∂Y\frac{\partial L}{\partial Y}YL 转置为 N×Ho×Co×WoN\times H_o \times C_o \times W_oN×Ho×Co×Wo

    @staticmethod@once_differentiable@custom_bwd# pylint: disable=huawei-too-many-arguments,too-many-return-valuesdef backward(ctx, grad_out):nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output = ctx.saved_tensorsnhwc_grad_out = grad_out.permute(0, 2, 1, 3).contiguous()grad_x, grad_weight, _, grad_offset, grad_mask = mx_driving._C.modulated_deformable_conv2d_backward(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,offset_output,nhwc_grad_out,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)return (grad_x,grad_offset,grad_mask,grad_weight,None,None,None,None,None,None,)

modulated_deformable_conv2d

group=1
else
modulated_deformable_conv2d
DeformableConv2dV2
DeformableConv2d

TORCH_CHECK_NPU 检查输入张量是否都存储在 NPU 设备上。

std::tuple<at::Tensor, at::Tensor> modulated_deformable_conv2d(const at::Tensor& input, const at::Tensor& offset,const at::Tensor& mask, const at::Tensor& weight, const c10::optional<at::Tensor>& bias_opt,at::IntArrayRef kernel_size, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation,int64_t groups, int64_t deformable_groups, int64_t with_bias){TORCH_CHECK_NPU(input);TORCH_CHECK_NPU(offset);TORCH_CHECK_NPU(mask);TORCH_CHECK_NPU(weight);

对维度和参数进行检查。

    TORCH_CHECK(input.dim() == INPUT_DIM, "input must to be a 4D Tensor, but got: ", input.dim());TORCH_CHECK(offset.dim() == INPUT_DIM, "offset has to be a 4D Tensor, but got: ", offset.dim());TORCH_CHECK(mask.dim() == INPUT_DIM, "mask has to be a 4D Tensor, but got: ", mask.dim());TORCH_CHECK(weight.dim() == INPUT_DIM, "weight has to be a 4D Tensor, but got: ", weight.dim());TORCH_CHECK(stride[0] > 0 && stride[1] > 0, "stride must be greater than 0");TORCH_CHECK(kernel_size[0] > 0 && kernel_size[1] > 0, "kernel_size must be greater than 0");TORCH_CHECK(dilation[0] > 0 && dilation[1] > 0, "dilation must be greater than 0");

c10::value_or_else 已经废弃了,推荐使用 std::optional::value_or。
安全地处理可选的bias_opt

    const at::Tensor& bias = c10::value_or_else(bias_opt, [] { return at::Tensor(); });uint32_t n = static_cast<uint32_t>(input.size(0));uint32_t c_in = static_cast<uint32_t>(input.size(3));uint32_t h_in = static_cast<uint32_t>(input.size(1));uint32_t w_in = static_cast<uint32_t>(input.size(2));uint32_t h_out = static_cast<uint32_t>(offset.size(1));uint32_t w_out = static_cast<uint32_t>(offset.size(2));uint32_t c_out = static_cast<uint32_t>(weight.size(0));uint32_t kh = static_cast<uint32_t>(weight.size(1));uint32_t kw = static_cast<uint32_t>(weight.size(2));TORCH_CHECK(kh == kernel_size[0] && kw == kernel_size[1], "kernel size mismatch");TORCH_CHECK(mask.size(-1) == kh * kw, "The shape of the mask is invalid");TORCH_CHECK(groups > 0, "groups must be greater than 0");TORCH_CHECK(c_out % groups == 0, "weight's out channel should be divided by groups");TORCH_CHECK(c_in % groups == 0, "input's channel should be divided by groups");bool modulated = true;

如果是无分组卷积并且输入通道数为256或512,调用 DeformableConv2dV2,否则调用 DeformableConv2d。两个算子的参数顺序不同。

DeformableConv2dV2 算子有两个输出:output的形状为 N×Ho×Wo×CoN \times H_o \times W_o \times C_oN×Ho×Wo×Cooffset_output的形状为 N×HoWo×khkw×CiN\times H_o W_o \times k_h k_w \times C_iN×HoWo×khkw×Ci
DeformableConv2d 算子有两个输出:output的形状为 N×Ho×Co×WoN \times H_o \times C_o \times W_oN×Ho×Co×Wooffset_output的形状为 N×Ho×Wo×G×khkwCiGN\times H_o \times W_o \times G \times \frac{k_h k_w C_i}{G}N×Ho×Wo×G×GkhkwCi

注意:DeformableConv2dV2 要求 khkw=9k_h k_w =9khkw=9,但是这里没有加判断条件。

    if ((groups == 1) && ((c_in == CHANNEL_256) || (c_in == CHANNEL_512))) {at::Tensor output = at::empty({n, h_out, w_out, c_out}, input.options());at::Tensor offset_output = at::empty({n, h_out * w_out, kh * kw, c_in}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2dV2, input, offset, mask, weight, bias, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 3, 1, 2});return std::tie(output, offset_output);} else {at::Tensor output = at::empty({n, h_out, c_out, w_out}, input.options());at::Tensor offset_output = at::empty({n, h_out, w_out, groups, kh * kw * c_in / groups}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2d, input, weight, bias, offset, mask, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 2, 1, 3});return std::tie(output, offset_output);}}

DeformableConv2dKernel::Process

DeformableConv2dKernel::Process
PreProcess
ProcessVector
ProcessCube
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::Process(){PreProcess();for (uint32_t taskIdx = start_; taskIdx < end_; ++taskIdx) {ProcessVector(taskIdx);ProcessCube(taskIdx);}mm_.End();}

DeformableConv2dKernel::PreProcess

所有 VectorCore 协作完成一行输出所需的索引的计算,类似于 Allgather 模式。
TBuf:: Get 从 TBuf 上获取指定长度的 Tensor,或者获取全部长度的 Tensor。
auxHauxW的大小为 WokhkwW_o k_h k_wWokhkw,存储一行输出的卷积窗口坐标 (hi,wi)(h_i, w_i)(hi,wi)
wi_start​=wo​⋅s−phi_start​=ho​⋅s−p\begin{aligned} w_{i\_start}​=w_o​ \cdot s−p\\ h_{i\_start}​=h_o​ \cdot s−p \end{aligned} wi_start=wosphi_start=hosp
由于auxHauxW预先计算后用于多行输出的索引,因此auxH中是窗口内的相对偏移,没有加 ho​⋅sh_o​ \cdot shos
auxStart_为当前核需要处理的起始索引。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::PreProcess(){LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();uint32_t idx = 0;for (int32_t w = auxStart_; w < auxEnd_; ++w) {for (int32_t i = 0; i < kH_; ++i) {for (int32_t j = 0; j < kW_; ++j) {auxW.SetValue(idx, static_cast<float>(w * strideW_ - padW_ + j * dilationW_));auxH.SetValue(idx, static_cast<float>(-padH_ + i * dilationH_));++idx;}}}

GlobalTensor::operator[] 根据输入的offset偏移返回新的 GlobalTensor。
valRptTimes_CiC_iCi 拷贝次数。
每个核将本地计算的索引拷贝到auxWGm_auxHGm_

    DataCopyPad(auxWGm_[auxStart_ * kernelSize_], auxW,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});DataCopyPad(auxHGm_[auxStart_ * kernelSize_], auxH,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});SyncAll();

同步后,从全局内存中拷贝得到完整的一行输出所需的索引。

注意:这里有个问题是两次全局内存访问的延迟比较高。卷积核通常为3x3,如果 WoW_oWo 比较小的情况下,每个核自行计算比联合计算的开销可能更小。

    DataCopy(auxW, auxWGm_, {1, rowOffsetBlk_, 0, 0});DataCopy(auxH, auxHGm_, {1, rowOffsetBlk_, 0, 0});

feature清零。

    LocalTensor<float> feature = featureBuf_.Get<float>();Duplicate<float, false>(feature, 0.f, MASK_PLACEHOLDER, 4 * valRptTimes_, 1, 8);}

DeformableConv2dKernel::ProcessVector

DeformableConv2dKernel::ProcessVector
CopyInOffset
ComputeWeight
ComputeBilinearInterpolation
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessVector(uint32_t taskIdx){uint32_t batch = taskIdx / hOut_;srcOffset_ = batch * hIn_ * wIn_ * cIn_;dstOffset_ = taskIdx * rowIn_;LocalTensor<float> offset = offsetBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<int32_t> offsetInt = offsetIntBuf_.Get<int32_t>();LocalTensor<float> weight = weightBuf_.Get<float>();LocalTensor<float> feature = featureBuf_.Get<float>();LocalTensor<float> mask;if (modulated) {mask = maskBuf_.Get<float>();}LocalTensor<float> offsetOutput = offsetOutputBuf_.Get<float>();

DeformableConv2dKernel::CopyInOffset 拷贝一行的 Δpn\Delta p_nΔpnΔm\Delta mΔm 并解交织 Δpn\Delta p_nΔpn
DeformableConv2dKernel::ComputeWeight 计算采样位置和插值权重。

    CopyInOffset(taskIdx, offset, mask);ComputeWeight(taskIdx, auxW, auxH, offset, offsetInt, weight, mask);SetFlag<HardEvent::V_MTE2>(calEvt_);WaitFlag<HardEvent::V_MTE2>(calEvt_);SetFlag<HardEvent::MTE3_V>(0);SetFlag<HardEvent::MTE3_V>(1);uint8_t ping = 0;

DeformableConv2dKernel::ComputeBilinearInterpolation 加载计算和保存。

    for (uint32_t w = 0; w < wOut_; ++w) {WaitFlag<HardEvent::MTE3_V>(ping);ComputeBilinearInterpolation(w, offset, offsetInt, feature, weight, offsetOutput[ping * kwIn_]);SetFlag<HardEvent::MTE3_V>(ping);ping = 1 - ping;}WaitFlag<HardEvent::MTE3_V>(0);WaitFlag<HardEvent::MTE3_V>(1);}

DeformableConv2dKernel::CopyInOffset

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::CopyInOffset(uint32_t taskIdx, const LocalTensor<float>& offset, const LocalTensor<float>& mask){uint32_t offsetIdx = taskIdx * rowOffset_ * 2;DataCopy(offset, offsetGm_[offsetIdx], {1, doubleRowOffsetBlk_, 0, 0});if (modulated) {DataCopy(mask, maskGm_[taskIdx * rowOffset_], {1, rowOffsetBlk_, 0, 0});}SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);uint64_t cnt;GatherMask(offset[2 * alignedRowOffset_], offset, 2, false, MASK_PLACEHOLDER, gatherParams_, cnt);GatherMask(offset[3 * alignedRowOffset_], offset, 1, false, MASK_PLACEHOLDER, gatherParams_, cnt);SetVectorMask<float>(FULL_MASK, FULL_MASK);}

DeformableConv2dKernel::ComputeWeight

offset是中间变量,offsetIntweight是输出,但使用常量引用。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeWeight(uint32_t taskIdx,const LocalTensor<float>& auxW, const LocalTensor<float>& auxH, const LocalTensor<float>& offset,const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& weight, const LocalTensor<float>& mask){

offset的大小为 4×Wokhkw4\times W_o k_h k_w4×Wokhkw,用于临时变量。
使用 Copy 指令取 xix_ixi
hyoy_oyoauxHyo​⋅sy_o​ \cdot syos 后为实际坐标 yiy_iyi
offset的前半部分为卷积窗口索引 p+pnp + p_np+pn

    int32_t h = taskIdx % hOut_;Copy<float, false>(offset, auxW, MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});Adds<float, false>(offset[alignedRowOffset_], auxH, float(h * strideH_), MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});

由于内存连续,一条加法指令实现浮点坐标的计算: p+pn+Δpnp + p_n+\Delta p_np+pn+Δpn

    Add<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8});

offsetInt转为整型坐标 (y1,x1)(y_1, x_1)(y1,x1)offset的后半部分存储浮点类型的左上角坐标。

    Cast<int32_t, float, false>(offsetInt, offset, RoundMode::CAST_FLOOR, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});Cast<float, int32_t, false>(offset[2 * alignedRowOffset_], offsetInt, RoundMode::CAST_NONE, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});

前半部分为差值 y−y1y-y_1yy1x−x1x-x_1xx1
weight为1,因此后半部分为 y2−yy_2-yy2yx2−xx_2-xx2x

    Sub<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // lw, lhDuplicate<float, false>(weight, 1.f, MASK_PLACEHOLDER, 2 * rptTimes_, 1, 8);Sub<float, false>(offset[2 * alignedRowOffset_], weight, offset, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // hw, hh

两个维度相乘得到4个点的权值:
w11=(x2−x)(y2−y)w21=(x−x1)(y2−y)w12=(x2−x)(y−y1)w22=(x−x1)(y−y1)\begin{aligned} w_{11} &= (x_2 -x)(y_2 -y) \\ w_{21} &=(x -x_1)(y_2 -y) \\ w_{12} &=(x_2 -x)(y -y_1)\\ w_{22} &=(x -x_1)(y -y_1) \end{aligned} w11w21w12w22=(x2x)(y2y)=(xx1)(y2y)=(x2x)(yy1)=(xx1)(yy1)

    Mul<float, false>(weight, offset[2 * alignedRowOffset_], offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // hw * hhMul<float, false>(weight[alignedRowOffset_], offset, offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], offset[alignedRowOffset_], offset[2 * alignedRowOffset_],MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], offset, offset[alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lh * lw

将调制权重 Δm\Delta_mΔm 乘到4个插值权重上。
weight 的形状为 4×Wokhkw4\times W_o k_h k_w4×Wokhkwmask的形状为 WokhkwW_o k_h k_wWokhkw,二者不等长导致需要调用4次。
代码注释没有删除。

    if (modulated) {Mul<float, false>(weight, weight, mask, MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8});Mul<float, false>(weight[alignedRowOffset_], weight[alignedRowOffset_], mask, MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], weight[2 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], weight[3 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // lh * lw}}

DeformableConv2dKernel::ComputeBilinearInterpolation

offset没有用到。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeBilinearInterpolation(uint32_t w,const LocalTensor<float>& offset, const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& feature,const LocalTensor<float>& weight, const LocalTensor<float>& offsetOutput){

首先将offsetOutput清零,其形状为 khkwCik_h k_w C_ikhkwCi

    Duplicate<float, false>(offsetOutput, 0.f, MASK_PLACEHOLDER, kernelSize_ * valRptTimes_, 1, 8);uint8_t ping = 0;uint32_t kernelOffset = w * kernelSize_;SetFlag<HardEvent::V_MTE2>(0);SetFlag<HardEvent::V_MTE2>(1);

传入 xox_oxo
pwph为数组中的索引。
gmOffset为输入点的一维偏移。
SetFlag 同一核内不同流水之间的同步指令。

Ascend C最佳实践 中建议尽量一次搬运较大的数据块。

#pragma bisheng auto_sync parallelfor (uint32_t kIdx = 0; kIdx < kernelSize_; ++kIdx) {uint32_t pw = kIdx + kernelOffset;uint32_t ph = pw + alignedRowOffset_;int32_t w0 = offsetInt.GetValue(pw);int32_t h0 = offsetInt.GetValue(ph);int32_t w1 = w0 + 1;int32_t h1 = h0 + 1;uint32_t outOffset = kIdx * cIn_;uint32_t ftOffset = ping * featureOffset_;WaitFlag<HardEvent::V_MTE2>(ping);

对于每个输入点 (y,x)(y, x)(y,x),如果 (y1,x1),(y1,x2),(y2,x1),(y2,x2)(y1,x1), (y1,x2), (y2,x1), (y2,x2)(y1,x1),(y1,x2),(y2,x1),(y2,x2) 均在图像内,则一次加载4个点。
Axpy 将输入元素与标量求积后,累加到目的元素。

        if (0 < h1 && h1 < hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpQuadValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == 0) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_;DataCopy(feature[ftOffset + 3 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}}SetFlag<HardEvent::V_MTE2>(ping);ping = 1 - ping;}

将插值得到 khkwCik_h k_w C_ikhkwCi 分段拷出到形状为 G×Wi×khkwCiGG \times W_i\times \frac{k_h k_w C_i}{G}G×Wi×GkhkwCi 的全局内存offsetOutputGm_中。

    SetFlag<HardEvent::V_MTE3>(calEvt_);WaitFlag<HardEvent::V_MTE3>(calEvt_);for (uint32_t i = 0; i < groups_; ++i) {DataCopy(offsetOutputGm_[dstOffset_ + rowInPerGroup_ * i], offsetOutput[i * cInPerGroup_], cpOffsetOutParams_);}dstOffset_ += kwInPerGroup_;WaitFlag<HardEvent::V_MTE2>(0);WaitFlag<HardEvent::V_MTE2>(1);}

DeformableConv2dKernel::ProcessCube

SetTensorB 设置矩阵乘的右矩阵 B 需要转置。
采用循环方式实现 Batch Matmul:weight 的形状为 G×CoG×khkwCiGG\times \frac{C_o}{G} \times \frac{k_h k_wC_i}{G}G×GCo×GkhkwCi,im2col 的形状为 G×Wo×khkwCiGG \times W_o\times \frac{k_h k_w C_i}{G}G×Wo×GkhkwCi,输出为 G×CoG×WoG\times \frac{C_o}{G}\times W_oG×GCo×Wo
这使得多核输出形状为 N×Ho×Co×WoN\times H_o \times C_o \times W_oN×Ho×Co×Wo

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessCube(uint32_t taskIdx){uint64_t aOffset = 0;uint64_t bOffset = taskIdx * rowIn_;uint64_t cOffset = taskIdx * rowOut_;for (uint32_t i = 0; i < groups_; ++i) {mm_.SetTensorA(weightGm_[aOffset]);mm_.SetTensorB(offsetOutputGm_[bOffset], true);mm_.template IterateAll<false>(yGm_[cOffset]);aOffset += kernelPerGroup_;bOffset += rowInPerGroup_;cOffset += rowOutPerGroup_;}}

DeformableConv2dV2Kernel::Process

DeformableConv2dV2Kernel::Process
ProcessVector
ProcessCube

DeformableConv2dV2Kernel::ProcessVector 每次生成 im2col 矩阵的一行。累积cubeTileTaskCount_行后,调用 DeformableConv2dV2Kernel::ProcessCube 进行卷积。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::Process(){for (int32_t taskIdx = start_; taskIdx < end_; taskIdx++) {ProcessVector(taskIdx);int32_t innerCubeTaskIdx = (taskIdx - start_) % cubeTileTaskCount_;bool startCubeFlag = (innerCubeTaskIdx == cubeTileTaskCount_ - 1) || (taskIdx == end_ - 1);if (startCubeFlag) {ProcessCube(taskIdx, innerCubeTaskIdx);}}mm_.End();}

DeformableConv2dV2Kernel::ProcessVector

DeformableConv2dV2Kernel::ProcessVector
CopyInFeature

每次调用处理卷积输出特征图上一个点对应的输入,即 im2col 矩阵的一行。将一个 kernel window 中的值展开成一列,写入 img2colMatGm_ 中。
taskIdx解码为对应的(n, h_out, w_out)

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessVector(uint32_t taskIdx){int16_t batchIdx = taskIdx / (featureMapSize_);int16_t hOutIdx = (taskIdx % (featureMapSize_)) / wOut_;int16_t wOutIdx = taskIdx % wOut_;

将当前taskIdx对应的 Δp\Delta pΔpΔm\Delta mΔm 加载到本地内存。单次拷贝18或9个元素,过少。

    // CopyIn OffsetDataCopy(copyInOffsetLocal_, offsetGm_[taskIdx * OFFSET_SIZE], OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInOffsetEventID);if (modulated) {DataCopy(maskLocal_, maskGm_[taskIdx * X_OFFSET_SIZE], X_OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInMaskEventID);}WaitFlag<HardEvent::MTE2_V>(copyInOffsetEventID);

将交错存储的 (Δy,Δx)(\Delta y, \Delta x)(Δy,Δx) 分离开,存入独立的xOffsetLocal_yOffsetLocal_缓冲区。
加上卷积窗口坐标得到 pn+Δpnp_n +\Delta p_npn+Δpn

    GatherMask(xOffsetLocal_, copyInOffsetLocal_, 1, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);GatherMask(yOffsetLocal_, copyInOffsetLocal_, 2, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);Add(xOffsetLocal_, xOffsetLocal_, constKHIdxLocal_, X_OFFSET_ALIGNED_SIZE);Add(yOffsetLocal_, yOffsetLocal_, constKWIdxLocal_, X_OFFSET_ALIGNED_SIZE);

对浮点坐标 (i+Δhi,j+Δwi)(i+\Delta h_i, j+\Delta w_i)(i+Δhi,j+Δwi) 取整得到双线性插值所需的四个方向的坐标。
计算小数偏移 。

    Floor(topPosLocal_, xOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Floor(leftPosLocal_, yOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Adds(bottomPosLocal_, topPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);Adds(rightPosLocal_, leftPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);

fracHfracW为单个方向上的插值权重 y−y1y-y_1yy1x−x1x-x_1xx1

    Sub(fracHLocal_, xOffsetLocal_, topPosLocal_, X_OFFSET_ALIGNED_SIZE);Sub(fracWLocal_, yOffsetLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);

用输出点的坐标减去卷积核的半径,从而找到与之对应的输入区域的起始位置。这里假定 sh=1,sw=1s_h =1, s_w=1sh=1,sw=1, 然而算子入口并没有设置该条件。
计算卷积窗口左上角的坐标 (h0,w0)(h_0,w_0)(h0,w0) 的公式为:
h0=ho⋅sh−phw0=wo⋅sw−pw\begin{aligned} h_0 &= h_o \cdot s_h - p_h \\ w_0 &= w_o \cdot s_w - p_w \end{aligned} h0w0=hoshph=woswpw
与相对坐标相加得到卷积窗口所有点的坐标 p+pn+Δpnp + p_n +\Delta p_np+pn+Δpn
topPosLocal_leftPosLocal_(y1,x1)(y_1, x_1)(y1,x1)

    // global positionAdds(topPosLocal_, topPosLocal_, hOutIdx - kH_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(leftPosLocal_, leftPosLocal_, wOutIdx - kW_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);

计算插值4组点在内存上的一维偏移:
offset1=(y1⋅Wi+x1)Cioffset2=(y1⋅Wi+x2)Cioffset3=(y2⋅Wi+x1)Cioffset4=(y2⋅Wi+x2)Ci\begin{aligned} \mathrm{offset}_1 &=(y_1\cdot W_i + x_1)C_i\\ \mathrm{offset}_2 &=(y_1\cdot W_i + x_2)C_i\\ \mathrm{offset}_3 &=(y_2\cdot W_i + x_1)C_i\\ \mathrm{offset}_4 &=(y_2\cdot W_i + x_2)C_i \end{aligned} offset1offset2offset3offset4=(y1Wi+x1)Ci=(y1Wi+x2)Ci=(y2Wi+x1)Ci=(y2Wi+x2)Ci
topLeftOffsetLocal_topRightOffsetLocal_bottomLeftOffsetLocal_bottomRightOffsetLocal_4个变量在内存上是连续的所有可以使用一条指令处理。

    // global OffsetMuls(topPosLocal_, topPosLocal_, wOut_ + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Add(topLeftOffsetLocal_, topPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE); // global (h * wOut + w)Add(topRightOffsetLocal_, topPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomLeftOffsetLocal_, bottomPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomRightOffsetLocal_, bottomPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Muls(topLeftOffsetLocal_, topLeftOffsetLocal_, cIn_ + 0.0f, 4 * X_OFFSET_ALIGNED_SIZE);Adds(topLeftOffsetLocal_, topLeftOffsetLocal_, batchIdx * featureMapElementsSize_ + 0.0f,4 * X_OFFSET_ALIGNED_SIZE); // global offset

CompareScalar 逐元素比较一个 tensor 中的元素和另一个 Scalar 的大小,结果在输出的对应比特位。
topPosLocal_bottomPosLocal_leftPosLocal_rightPosLocal_四个变量的内存是连续的,每个变量的大小为 X_OFFSET_ALIGNED_SIZE。这里直接使用了64作为长度。
可以看出,由于地址对齐限制,36个有效元素对齐到64。
inGlobalLocal_的大小为IN_GLOBAL_BUF_SIZE * sizeof(uint32_t),存储4组点在两个方向上是否在边界内。
inGlobalLocal_为 uint32_t 类型,每条 CompareScalar 处理64个元素,保存到inGlobalLocal_中每段的前2两个元素中。
比较 0≤y1,0≤y2,0≤x1,0≤x20 \le y_1,\enspace 0 \le y_2,\enspace 0 \le x_1,\enspace 0 \le x_20y1,0y2,0x1,0x2 以及 y1<Hi,y2<Hi,x1<Wi,x2<Wiy_1< H_i,\enspace y_2 < H_i\enspace, x_1 < W_i,\enspace x_2 < W_iy1<Hi,y2<Hi,x1<Wi,x2<Wi

    // in global flagCompareScalar(inGlobalLocal_.ReinterpretCast<uint8_t>(), topPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[8].ReinterpretCast<uint8_t>(), bottomPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[16].ReinterpretCast<uint8_t>(), leftPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[24].ReinterpretCast<uint8_t>(), rightPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[32].ReinterpretCast<uint8_t>(), topPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[40].ReinterpretCast<uint8_t>(), bottomPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[48].ReinterpretCast<uint8_t>(), leftPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[56].ReinterpretCast<uint8_t>(), rightPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);

合并两个方向的结果,即 0≤y1<Hi,0≤y2<Hi,0≤x1<Wi,0≤x2<Wi0 \le y_1 < H_i,\enspace 0 \le y_2 < H_i,\enspace 0 \le x_1 < W_i,\enspace 0 \le x_2 < W_i0y1<Hi,0y2<Hi,0x1<Wi,0x2<Wi

    And(inGlobalLocal_[32].ReinterpretCast<uint16_t>(), inGlobalLocal_.ReinterpretCast<uint16_t>(),inGlobalLocal_[32].ReinterpretCast<uint16_t>(), 64);

计算合法的 (y1,x1)(y_1, x_1)(y1,x1)(y2,x2)(y_2, x_2)(y2,x2)

    And(inGlobalLocal_.ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 32); // TopLeft, BottomRight

计算合法的 (y1,x2)(y_1, x_2)(y1,x2)(y2,x1)(y_2, x_1)(y2,x1)

    And(inGlobalLocal_[16].ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[56].ReinterpretCast<uint16_t>(), 16); // TopRightAnd(inGlobalLocal_[24].ReinterpretCast<uint16_t>(), inGlobalLocal_[40].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 16); // BottomLeft

Select 根据selMask(用于选择的 Mask 掩码)的比特位值选取元素。
将4组点的越界位置设置为-1.0f,后续拷贝时可直接丢弃或处理为0。

    Select(topLeftOffsetLocal_, inGlobalLocal_.ReinterpretCast<uint16_t>(), topLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomRightOffsetLocal_, inGlobalLocal_[8].ReinterpretCast<uint16_t>(), bottomRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(topRightOffsetLocal_, inGlobalLocal_[16].ReinterpretCast<uint16_t>(), topRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomLeftOffsetLocal_, inGlobalLocal_[24].ReinterpretCast<uint16_t>(), bottomLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

需要插 scalar 等待 vector 的同步。
oneSubFracHLocal_oneSubFracWLocal_的内存是连续的。
计算一维插值权重 y2−yy_2 - yy2yx2−xx_2 - xx2x

    SetFlag<HardEvent::V_S>(V_SEventID);WaitFlag<HardEvent::V_S>(V_SEventID);Muls(oneSubFracHLocal_, fracHLocal_, -1.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(oneSubFracHLocal_, oneSubFracHLocal_, 1.0f, 2 * X_OFFSET_ALIGNED_SIZE); // 1-fracH, 1-fracW

调制权重乘到4个插值权重上:Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)\Delta m(y -y_1),\enspace \Delta m(x -x_1),\enspace \Delta m(y_2 -y),\enspace \Delta m(x_2 -x)Δm(yy1),Δm(xx1),Δm(y2y),Δm(x2x)

    if (modulated) {WaitFlag<HardEvent::MTE2_V>(copyInMaskEventID);Mul(fracHLocal_, fracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);Mul(oneSubFracHLocal_, oneSubFracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);}

Brcb 给定一个输入张量,每一次取输入张量中的8个数填充到结果张量的8个 datablock(32Bytes)中去,每个数对应一个 datablock。
插值系数与输入相乘时需要进行低维广播。下面的计算中,二者不等长,将每个系数广播为 Ci8\frac{C_i}{8}8Ci
fracHBroadcastLocal_空间大小为 9×Ci8×block9\times \frac{C_i}{8}\times \mathrm{block}9×8Ci×block
brcbParams_中设置元素间隔为 Ci64\frac{C_i}{64}64Ci 个 block,迭代间隔为 Ci8\frac{C_i}{8}8Ci 个 block。即将 CiC_iCi 八等分,等分位上的 datablock 为有效值,其他位置无效。
横跨空间大小 16×Ci64×block=2Ci16\times \frac{C_i}{64}\times\mathrm{block} = 2C_i16×64Ci×block=2Ci
fracHLocal_的每个元素填充到fracHBroadcastLocal_中的一个 datablock,相邻元素间隔8个 datablock,即 Ci64\frac{C_i}{64}64Ci

    // BroadcastBrcb(fracHBroadcastLocal_, fracHLocal_, 2, brcbParams_);Brcb(fracWBroadcastLocal_, fracWLocal_, 2, brcbParams_);Brcb(oneSubFracHBroadcastLocal_, oneSubFracHLocal_, 2, brcbParams_);Brcb(oneSubFracWBroadcastLocal_, oneSubFracWLocal_, 2, brcbParams_);

DATA_BLOCK_SIZE 为8,FOUR_CORNERS 为4,X_OFFSET_ALIGNED_SIZE 为9。
maskForBroadcast_等于dataBlockPerInputChannel_ - DATA_BLOCK_SIZE
通过一条 Copy 指令将第一个 datablock 的数据广播到 CiC_iCi 中的其他块,形状为 4×9×Ci4\times 9\times C_i4×9×Ci
每次迭代拷贝的 block 数量为:
N=⌈Mask8⌉=⌈Ci8−88⌉=⌈Ci64⌉−1\begin{aligned} N &= \lceil\frac{\mathrm{Mask}}{8}\rceil \\ &= \lceil\frac{\frac{C_i}{8}-8}{8}\rceil \\ &= \lceil\frac{C_i}{64}\rceil-1 \end{aligned} N=8Mask=88Ci8=64Ci1
srcRepeatSizedstRepeatSize参数设置为 Ci64\frac{C_i}{64}64Ci
在第一步的广播中,相邻元素间隔 Ci64\frac{C_i}{64}64Ci,这使得每组插值权重有效值长度为 9Ci8\frac{9C_i}{8}89Ci

    Copy(fracHBroadcastLocal_[DATA_BLOCK_SIZE], fracHBroadcastLocal_, maskForBroadcast_, FOUR_CORNERS * X_OFFSET_SIZE,copyParams_);

DeformableConv2dV2Kernel::CopyInFeature 函数根据topLeftOffsetLocal_fracHBroadcastLocal_加载输入并插值。
然后将outFeatureLocal_中的结果拷贝到全局内存中。

    CopyInFeature();SetFlag<HardEvent::V_MTE3>(copyOutEventID);WaitFlag<HardEvent::V_MTE3>(copyOutEventID);DataCopyPad(img2colMatGm_[taskIdx * elementsCountPerTask_], outFeatureLocal_,{1, static_cast<uint32_t>(elementsCountPerTask_ * FP32_BYTE_SIZE), 0, 0, 0});}

DeformableConv2dV2Kernel::CopyInFeature

函数没有参数,导致看不出依赖的变量。
topLeft0等值应该与整数进行比较。
代码直接展开,似乎可以像 V1中那样写成 for 循环。
加载9个输入点的通道后,与权重相乘。
topLeftWeightLocal_Δm⋅w11=Δm(y2−y)(x2−x)\Delta m \cdot w_{11}=\Delta m(y_2 - y)(x_2 -x)Δmw11=Δm(y2y)(x2x)
topLeftWeightLocal_中仅前面的 9Ci8\frac{9C_i}{8}89Ci 个元素有效。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::CopyInFeature(){int32_t topLeft0 = topLeftOffsetLocal_.GetValue(0);int32_t topLeft1 = topLeftOffsetLocal_.GetValue(1);int32_t topLeft2 = topLeftOffsetLocal_.GetValue(2);int32_t topLeft3 = topLeftOffsetLocal_.GetValue(3);int32_t topLeft4 = topLeftOffsetLocal_.GetValue(4);int32_t topLeft5 = topLeftOffsetLocal_.GetValue(5);int32_t topLeft6 = topLeftOffsetLocal_.GetValue(6);int32_t topLeft7 = topLeftOffsetLocal_.GetValue(7);int32_t topLeft8 = topLeftOffsetLocal_.GetValue(8);(topLeft0 == -1.0f) ? Duplicate(topLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[0 * cIn_], xGm_[topLeft0], cIn_);(topLeft1 == -1.0f) ? Duplicate(topLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[1 * cIn_], xGm_[topLeft1], cIn_);(topLeft2 == -1.0f) ? Duplicate(topLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[2 * cIn_], xGm_[topLeft2], cIn_);(topLeft3 == -1.0f) ? Duplicate(topLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[3 * cIn_], xGm_[topLeft3], cIn_);(topLeft4 == -1.0f) ? Duplicate(topLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[4 * cIn_], xGm_[topLeft4], cIn_);(topLeft5 == -1.0f) ? Duplicate(topLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[5 * cIn_], xGm_[topLeft5], cIn_);(topLeft6 == -1.0f) ? Duplicate(topLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[6 * cIn_], xGm_[topLeft6], cIn_);(topLeft7 == -1.0f) ? Duplicate(topLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[7 * cIn_], xGm_[topLeft7], cIn_);(topLeft8 == -1.0f) ? Duplicate(topLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[8 * cIn_], xGm_[topLeft8], cIn_);Mul(topLeftWeightLocal_, oneSubFracHBroadcastLocal_, oneSubFracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);

Mul 设置src1BlkStride为0,实现了低维广播的乘法。topLeftWeightLocal_的每个 datablock 与topLeftFeatureLocal_的连续的8个 datablock 相乘。
src1RepStride为1。
repeatTimes_等于 9Ci8×8\frac{9C_i}{8\times 8}8×89Ci,即总计处理 9Ci9C_i9Ci 个元素。

想要实现9个点的乘法,权重需要以ci/DATA_SIZE_PER_REPEAT的长度分段放置。

    SetFlag<HardEvent::MTE3_V>(MTE3_VEventID);WaitFlag<HardEvent::MTE3_V>(MTE3_VEventID);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);Mul(outFeatureLocal_, topLeftFeatureLocal_, topLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
    int32_t topRight0 = topRightOffsetLocal_.GetValue(0);int32_t topRight1 = topRightOffsetLocal_.GetValue(1);int32_t topRight2 = topRightOffsetLocal_.GetValue(2);int32_t topRight3 = topRightOffsetLocal_.GetValue(3);int32_t topRight4 = topRightOffsetLocal_.GetValue(4);int32_t topRight5 = topRightOffsetLocal_.GetValue(5);int32_t topRight6 = topRightOffsetLocal_.GetValue(6);int32_t topRight7 = topRightOffsetLocal_.GetValue(7);int32_t topRight8 = topRightOffsetLocal_.GetValue(8);(topRight0 == -1.0f) ? Duplicate(topRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[0 * cIn_], xGm_[topRight0], cIn_);(topRight1 == -1.0f) ? Duplicate(topRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[1 * cIn_], xGm_[topRight1], cIn_);(topRight2 == -1.0f) ? Duplicate(topRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[2 * cIn_], xGm_[topRight2], cIn_);(topRight3 == -1.0f) ? Duplicate(topRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[3 * cIn_], xGm_[topRight3], cIn_);(topRight4 == -1.0f) ? Duplicate(topRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[4 * cIn_], xGm_[topRight4], cIn_);(topRight5 == -1.0f) ? Duplicate(topRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[5 * cIn_], xGm_[topRight5], cIn_);(topRight6 == -1.0f) ? Duplicate(topRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[6 * cIn_], xGm_[topRight6], cIn_);(topRight7 == -1.0f) ? Duplicate(topRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[7 * cIn_], xGm_[topRight7], cIn_);(topRight8 == -1.0f) ? Duplicate(topRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[8 * cIn_], xGm_[topRight8], cIn_);Mul(topRightWeightLocal_, oneSubFracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, topRightFeatureLocal_, topRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomLeft0 = bottomLeftOffsetLocal_.GetValue(0);int32_t bottomLeft1 = bottomLeftOffsetLocal_.GetValue(1);int32_t bottomLeft2 = bottomLeftOffsetLocal_.GetValue(2);int32_t bottomLeft3 = bottomLeftOffsetLocal_.GetValue(3);int32_t bottomLeft4 = bottomLeftOffsetLocal_.GetValue(4);int32_t bottomLeft5 = bottomLeftOffsetLocal_.GetValue(5);int32_t bottomLeft6 = bottomLeftOffsetLocal_.GetValue(6);int32_t bottomLeft7 = bottomLeftOffsetLocal_.GetValue(7);int32_t bottomLeft8 = bottomLeftOffsetLocal_.GetValue(8);(bottomLeft0 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[0 * cIn_], xGm_[bottomLeft0], cIn_);(bottomLeft1 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[1 * cIn_], xGm_[bottomLeft1], cIn_);(bottomLeft2 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[2 * cIn_], xGm_[bottomLeft2], cIn_);(bottomLeft3 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[3 * cIn_], xGm_[bottomLeft3], cIn_);(bottomLeft4 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[4 * cIn_], xGm_[bottomLeft4], cIn_);(bottomLeft5 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[5 * cIn_], xGm_[bottomLeft5], cIn_);(bottomLeft6 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[6 * cIn_], xGm_[bottomLeft6], cIn_);(bottomLeft7 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[7 * cIn_], xGm_[bottomLeft7], cIn_);(bottomLeft8 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[8 * cIn_], xGm_[bottomLeft8], cIn_);Mul(bottomLeftWeightLocal_, oneSubFracWBroadcastLocal_, fracHBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomLeftFeatureLocal_, bottomLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomRight0 = bottomRightOffsetLocal_.GetValue(0);int32_t bottomRight1 = bottomRightOffsetLocal_.GetValue(1);int32_t bottomRight2 = bottomRightOffsetLocal_.GetValue(2);int32_t bottomRight3 = bottomRightOffsetLocal_.GetValue(3);int32_t bottomRight4 = bottomRightOffsetLocal_.GetValue(4);int32_t bottomRight5 = bottomRightOffsetLocal_.GetValue(5);int32_t bottomRight6 = bottomRightOffsetLocal_.GetValue(6);int32_t bottomRight7 = bottomRightOffsetLocal_.GetValue(7);int32_t bottomRight8 = bottomRightOffsetLocal_.GetValue(8);(bottomRight0 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[0 * cIn_], xGm_[bottomRight0], cIn_);(bottomRight1 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[1 * cIn_], xGm_[bottomRight1], cIn_);(bottomRight2 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[2 * cIn_], xGm_[bottomRight2], cIn_);(bottomRight3 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[3 * cIn_], xGm_[bottomRight3], cIn_);(bottomRight4 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[4 * cIn_], xGm_[bottomRight4], cIn_);(bottomRight5 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[5 * cIn_], xGm_[bottomRight5], cIn_);(bottomRight6 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[6 * cIn_], xGm_[bottomRight6], cIn_);(bottomRight7 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[7 * cIn_], xGm_[bottomRight7], cIn_);(bottomRight8 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[8 * cIn_], xGm_[bottomRight8], cIn_);Mul(bottomRightWeightLocal_, fracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomRightFeatureLocal_, bottomRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});}

DeformableConv2dV2Kernel::ProcessCube

innerCubeTaskIdx为末尾元素索引。这里假定起始索引为0,因此可以得到 im2col 的行数cubeTaskCount
elementsCountPerTask_khkwCik_h k_w C_ikhkwCi
aOffsetcOffset分别为当前核在 A 和 C 矩阵上的起始偏移。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessCube(uint32_t taskIdx, const int32_t& innerCubeTaskIdx){int32_t cubeTaskCount = innerCubeTaskIdx + 1;uint64_t aOffset = (taskIdx - innerCubeTaskIdx) * elementsCountPerTask_;uint64_t cOffset = (taskIdx - innerCubeTaskIdx) * cOut_;

SetTensorA 设置矩阵乘的左矩阵 A。
SetTensorB 设置矩阵乘的右矩阵B。
SetSingleShape 设置 Matmul 单核计算的形状 singleMIn,singleNIn,singleKIn,单位为元素。

IterateAll 计算出 singleCoreM * singleCoreN 大小的 C 矩阵。迭代顺序可通过 tiling 参数 iterateOrder 调整。

img2col 的形状为 128×khkwCi128\times k_h k_w C_i128×khkwCi,weight 的形状为 Co×khkwCiC_o \times k_h k_w C_iCo×khkwCi,输出形状为 128×Co128\times C_o128×Co

    mm_.SetTensorA(img2colMatGm_[aOffset]);mm_.SetTensorB(weightGm_, true);mm_.SetSingleShape(cubeTaskCount, cOut_, elementsCountPerTask_);mm_.template IterateAll<false>(yGm_[cOffset]);}

参考资料:

  • SetL2CacheHint
  • 关于 Ascend C 的一些思考
  • Pushing the Limits: Huawei’s AI Chip Tests U.S. Export Controls
  • FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
  • 昇腾310P使用记录
  • PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
  • 硬件架构抽象
  • 非对齐场景
  • 核函数
  • 使用说明
  • NPU的硬化 Task Scheduler 介绍
  • Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads
  • Nvidia GPU与Huawei NPU
  • 7.5. 计算调度与执行
  • 面向昇腾处理器的高性能同步原语自动插入方法
  • 同步控制简介
  • 设置指定芯片的AI CPU、control CPU和data CPU数量
  • Broadcast
  • AIV 和 AIC 组合启动问题
http://www.dtcms.com/a/326801.html

相关文章:

  • GESP2023年9月认证C++一级( 第三部分编程题(1)买文具)
  • MATLAB实现遗传算法求解路网路由问题
  • PTE之路--03文
  • 【08-神经网络介绍】
  • 北京-4年功能测试2年空窗-报培训班学测开-第七十三天-投递简历-[特殊字符][特殊字符]
  • Linux驱动学习day27天(USB驱动理论部分)
  • SSR-code 项目复刻与3D模型生成实现
  • nomachine的安装和使用
  • 华清远见25072班C语言学习day6
  • 操作系统1.5:操作系统引导
  • 101. 孤岛的总面积
  • 下一代防火墙组网
  • 晓知识: 动态代理与静态代理的区别
  • Android模块化架构深度解析:从设计到实践
  • 强联通分量(重制版)
  • 环境配置-拉取NVIDIA Docker镜像时出现401Unauthorized错误
  • 数据填报是什么?数据填报工具有哪些?
  • 黑马程序员mysql基础篇笔记
  • 自定义switch with icon
  • 使用Pytest进行接口自动化测试(三)
  • 深入了解torch框架
  • 数据类型 string
  • 算法题——IP地址分类与子网掩码
  • CobaltStrike安装和使用教程
  • Cobalt Strike的搭建和使用
  • JDK21虚拟线程和 Golang1.24协程的比较
  • STM32——system文件夹
  • Empire--安装、使用
  • 集团型企业如何统一管控子公司权限?
  • 奈飞工厂:算法优化实战​