当前位置：首页 > news >正文

Ascend DrivingSDK 中的 modulated_deform_conv2d（一）

news 2025/8/12 15:36:40

Ascend DrivingSDK 是基于昇腾 NPU 平台开发的适用于自动驾驶场景的算子和模型加速库，提供了一系列高性能的算子和模型加速接口，支持 PyTorch 框架。

Ascend DrivingSDK 中的 modulated_deform_conv2d 是少有的融合算子，使用单个 kernel 完成 Deformable Convolution 的计算。然而由于910B 采用 vector core 和 cube core 分离架构，二者间的同步开销较大。910B 系列芯片拥有高达96MB 到192MB 的 L2缓存，并且在默认情况下开启。因此，modulated_deform_conv2d 算子的输入基本都在 L2缓存上。

modulated_deform_conv2d 在 Ascend C 层面有两个算子，v2针对3x3卷积进行了优化。奇怪的是这里没有选择新增 kernel 而是新增算子。不知出于何种原因，两个版本的算子参数列表顺序不同。

op	PreProcess	ComputeWeight	ComputeBilinearInterpolation	ProcessCube	Output
deformable_conv2d	缓存一行的卷积窗口索引，复用 $H_o$ 次	向量计算 $W_o k_h k_w$ 个点的权重	单次加载 $4C_i$ 数据插值	单次计算 $W_o$ 的结果	暂存 im2col 用于 $∂L∂W\frac{\partial L}{\partial W}$
deformable_conv2d_v2	/	向量计算 $k_h k_w$ 个点的权重	$k_h k_w$ 条加载 $C_i$ 的指令然后插值	单次计算128个输出点	/

v2 性能比 v1好，原因应该是 v2中内存拷贝的同步间隔更长（9:4）。全局内存访问的延迟很高，v1单条指令拷贝 $4C_i$ ，v2单次加载 $C_i$ ，但是9条指令后才同步。v1 缓存 im2col 节省计算，但是又没有省太多。因为计算 $∂L∂Δmn\frac{\partial L}{\partial \Delta \mathbf{m}_n}$ 时仍然需要双线性插值的结果。

两个算子功能不完备，例如不支持 half 类型、不支持 deform_group、不支持 bias 等。v2算子更为简陋。在此情况下无文档描述和防呆，使用时不免要费一些周折。不知出于何种原因，两个版本的算子参数列表顺序不同。难以想象这是工业级的代码，遑论车规。唯一的优点是像 AMD 一样开源，期待用户自己定位解决。

ModulatedDeformConv2dFunction

将输入转为 NHWC 格式。

class ModulatedDeformConv2dFunction(Function):@staticmethod@custom_fwd(cast_inputs=torch.float32)# pylint: disable=huawei-too-many-argumentsdef forward(ctx,x: torch.Tensor,offset: torch.Tensor,mask: torch.Tensor,weight: torch.Tensor,bias: Optional[nn.Parameter] = None,stride: Union[int, Tuple[int, ...]] = 1,padding: Union[int, Tuple[int, ...]] = 0,dilation: Union[int, Tuple[int, ...]] = 1,groups: int = 1,deformable_groups: int = 1,):ctx.kernel_size = [weight.size(2), weight.size(3)]ctx.stride = _pair(stride)ctx.padding = _pair(padding)ctx.dilation = _pair(dilation)ctx.groups = groupsctx.deformable_groups = deformable_groupsnhwc_x = x.permute(0, 2, 3, 1).contiguous()nhwc_offset = offset.permute(0, 2, 3, 1).contiguous()nhwc_weight = weight.permute(0, 2, 3, 1).contiguous()nhwc_mask = mask.permute(0, 2, 3, 1).contiguous()out, offset_output = mx_driving._C.modulated_deformable_conv2d(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)ctx.save_for_backward(nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output)return out

ModulatedDeformConv2dFunction.backward

将 $∂L∂Y\frac{\partial L}{\partial Y}$ 转置为 $N×Ho×Co×WoN\times H_o \times C_o \times W_o$

    @staticmethod@once_differentiable@custom_bwd# pylint: disable=huawei-too-many-arguments,too-many-return-valuesdef backward(ctx, grad_out):nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output = ctx.saved_tensorsnhwc_grad_out = grad_out.permute(0, 2, 1, 3).contiguous()grad_x, grad_weight, _, grad_offset, grad_mask = mx_driving._C.modulated_deformable_conv2d_backward(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,offset_output,nhwc_grad_out,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)return (grad_x,grad_offset,grad_mask,grad_weight,None,None,None,None,None,None,)

modulated_deformable_conv2d

TORCH_CHECK_NPU 检查输入张量是否都存储在 NPU 设备上。

std::tuple<at::Tensor, at::Tensor> modulated_deformable_conv2d(const at::Tensor& input, const at::Tensor& offset,const at::Tensor& mask, const at::Tensor& weight, const c10::optional<at::Tensor>& bias_opt,at::IntArrayRef kernel_size, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation,int64_t groups, int64_t deformable_groups, int64_t with_bias){TORCH_CHECK_NPU(input);TORCH_CHECK_NPU(offset);TORCH_CHECK_NPU(mask);TORCH_CHECK_NPU(weight);

对维度和参数进行检查。

    TORCH_CHECK(input.dim() == INPUT_DIM, "input must to be a 4D Tensor, but got: ", input.dim());TORCH_CHECK(offset.dim() == INPUT_DIM, "offset has to be a 4D Tensor, but got: ", offset.dim());TORCH_CHECK(mask.dim() == INPUT_DIM, "mask has to be a 4D Tensor, but got: ", mask.dim());TORCH_CHECK(weight.dim() == INPUT_DIM, "weight has to be a 4D Tensor, but got: ", weight.dim());TORCH_CHECK(stride[0] > 0 && stride[1] > 0, "stride must be greater than 0");TORCH_CHECK(kernel_size[0] > 0 && kernel_size[1] > 0, "kernel_size must be greater than 0");TORCH_CHECK(dilation[0] > 0 && dilation[1] > 0, "dilation must be greater than 0");

c10::value_or_else 已经废弃了，推荐使用 std::optional::value_or。
安全地处理可选的bias_opt。

    const at::Tensor& bias = c10::value_or_else(bias_opt, [] { return at::Tensor(); });uint32_t n = static_cast<uint32_t>(input.size(0));uint32_t c_in = static_cast<uint32_t>(input.size(3));uint32_t h_in = static_cast<uint32_t>(input.size(1));uint32_t w_in = static_cast<uint32_t>(input.size(2));uint32_t h_out = static_cast<uint32_t>(offset.size(1));uint32_t w_out = static_cast<uint32_t>(offset.size(2));uint32_t c_out = static_cast<uint32_t>(weight.size(0));uint32_t kh = static_cast<uint32_t>(weight.size(1));uint32_t kw = static_cast<uint32_t>(weight.size(2));TORCH_CHECK(kh == kernel_size[0] && kw == kernel_size[1], "kernel size mismatch");TORCH_CHECK(mask.size(-1) == kh * kw, "The shape of the mask is invalid");TORCH_CHECK(groups > 0, "groups must be greater than 0");TORCH_CHECK(c_out % groups == 0, "weight's out channel should be divided by groups");TORCH_CHECK(c_in % groups == 0, "input's channel should be divided by groups");bool modulated = true;

如果是无分组卷积并且输入通道数为256或512，调用 DeformableConv2dV2，否则调用 DeformableConv2d。两个算子的参数顺序不同。

DeformableConv2dV2 算子有两个输出：output的形状为 $\times H_o \times W_o \times C_o$ ，offset_output的形状为 $N×HoWo×khkw×CiN\times H_o W_o \times k_h k_w \times C_i$ ；
DeformableConv2d 算子有两个输出：output的形状为 $\times H_o \times C_o \times W_o$ ，offset_output的形状为 $N×Ho×Wo×G×khkwCiGN\times H_o \times W_o \times G \times \frac{k_h k_w C_i}{G}$ 。

注意：DeformableConv2dV2 要求 $k_h k_w =9$ ，但是这里没有加判断条件。

    if ((groups == 1) && ((c_in == CHANNEL_256) || (c_in == CHANNEL_512))) {at::Tensor output = at::empty({n, h_out, w_out, c_out}, input.options());at::Tensor offset_output = at::empty({n, h_out * w_out, kh * kw, c_in}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2dV2, input, offset, mask, weight, bias, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 3, 1, 2});return std::tie(output, offset_output);} else {at::Tensor output = at::empty({n, h_out, c_out, w_out}, input.options());at::Tensor offset_output = at::empty({n, h_out, w_out, groups, kh * kw * c_in / groups}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2d, input, weight, bias, offset, mask, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 2, 1, 3});return std::tie(output, offset_output);}}

DeformableConv2dKernel::Process

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::Process(){PreProcess();for (uint32_t taskIdx = start_; taskIdx < end_; ++taskIdx) {ProcessVector(taskIdx);ProcessCube(taskIdx);}mm_.End();}

DeformableConv2dKernel::PreProcess

所有 VectorCore 协作完成一行输出所需的索引的计算，类似于 Allgather 模式。
TBuf:: Get 从 TBuf 上获取指定长度的 Tensor，或者获取全部长度的 Tensor。
auxH和auxW的大小为 $W_o k_h k_w$ ，存储一行输出的卷积窗口坐标 $h_i, w_i)$ 。
$wi_start=wo⋅s−phi_start=ho⋅s−p\begin{aligned} w_{i\_start}=w_o \cdot s−p\\ h_{i\_start}=h_o \cdot s−p \end{aligned}$
由于auxH和auxW预先计算后用于多行输出的索引，因此auxH中是窗口内的相对偏移，没有加 $ho⋅sh_o \cdot s$ 。
auxStart_为当前核需要处理的起始索引。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::PreProcess(){LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();uint32_t idx = 0;for (int32_t w = auxStart_; w < auxEnd_; ++w) {for (int32_t i = 0; i < kH_; ++i) {for (int32_t j = 0; j < kW_; ++j) {auxW.SetValue(idx, static_cast<float>(w * strideW_ - padW_ + j * dilationW_));auxH.SetValue(idx, static_cast<float>(-padH_ + i * dilationH_));++idx;}}}

GlobalTensor::operator[] 根据输入的offset偏移返回新的 GlobalTensor。
valRptTimes_是 $C_i$ 拷贝次数。
每个核将本地计算的索引拷贝到auxWGm_和auxHGm_。

    DataCopyPad(auxWGm_[auxStart_ * kernelSize_], auxW,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});DataCopyPad(auxHGm_[auxStart_ * kernelSize_], auxH,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});SyncAll();

同步后，从全局内存中拷贝得到完整的一行输出所需的索引。

注意：这里有个问题是两次全局内存访问的延迟比较高。卷积核通常为3x3，如果 $W_o$ 比较小的情况下，每个核自行计算比联合计算的开销可能更小。

    DataCopy(auxW, auxWGm_, {1, rowOffsetBlk_, 0, 0});DataCopy(auxH, auxHGm_, {1, rowOffsetBlk_, 0, 0});

将feature清零。

    LocalTensor<float> feature = featureBuf_.Get<float>();Duplicate<float, false>(feature, 0.f, MASK_PLACEHOLDER, 4 * valRptTimes_, 1, 8);}

DeformableConv2dKernel::ProcessVector

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessVector(uint32_t taskIdx){uint32_t batch = taskIdx / hOut_;srcOffset_ = batch * hIn_ * wIn_ * cIn_;dstOffset_ = taskIdx * rowIn_;LocalTensor<float> offset = offsetBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<int32_t> offsetInt = offsetIntBuf_.Get<int32_t>();LocalTensor<float> weight = weightBuf_.Get<float>();LocalTensor<float> feature = featureBuf_.Get<float>();LocalTensor<float> mask;if (modulated) {mask = maskBuf_.Get<float>();}LocalTensor<float> offsetOutput = offsetOutputBuf_.Get<float>();

DeformableConv2dKernel::CopyInOffset 拷贝一行的 $Δpn\Delta p_n$ 和 $Δm\Delta m$ 并解交织 $Δpn\Delta p_n$ 。
DeformableConv2dKernel::ComputeWeight 计算采样位置和插值权重。

    CopyInOffset(taskIdx, offset, mask);ComputeWeight(taskIdx, auxW, auxH, offset, offsetInt, weight, mask);SetFlag<HardEvent::V_MTE2>(calEvt_);WaitFlag<HardEvent::V_MTE2>(calEvt_);SetFlag<HardEvent::MTE3_V>(0);SetFlag<HardEvent::MTE3_V>(1);uint8_t ping = 0;

DeformableConv2dKernel::ComputeBilinearInterpolation 加载计算和保存。

    for (uint32_t w = 0; w < wOut_; ++w) {WaitFlag<HardEvent::MTE3_V>(ping);ComputeBilinearInterpolation(w, offset, offsetInt, feature, weight, offsetOutput[ping * kwIn_]);SetFlag<HardEvent::MTE3_V>(ping);ping = 1 - ping;}WaitFlag<HardEvent::MTE3_V>(0);WaitFlag<HardEvent::MTE3_V>(1);}

DeformableConv2dKernel::CopyInOffset

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::CopyInOffset(uint32_t taskIdx, const LocalTensor<float>& offset, const LocalTensor<float>& mask){uint32_t offsetIdx = taskIdx * rowOffset_ * 2;DataCopy(offset, offsetGm_[offsetIdx], {1, doubleRowOffsetBlk_, 0, 0});if (modulated) {DataCopy(mask, maskGm_[taskIdx * rowOffset_], {1, rowOffsetBlk_, 0, 0});}SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);uint64_t cnt;GatherMask(offset[2 * alignedRowOffset_], offset, 2, false, MASK_PLACEHOLDER, gatherParams_, cnt);GatherMask(offset[3 * alignedRowOffset_], offset, 1, false, MASK_PLACEHOLDER, gatherParams_, cnt);SetVectorMask<float>(FULL_MASK, FULL_MASK);}

DeformableConv2dKernel::ComputeWeight

offset是中间变量，offsetInt和weight是输出，但使用常量引用。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeWeight(uint32_t taskIdx,const LocalTensor<float>& auxW, const LocalTensor<float>& auxH, const LocalTensor<float>& offset,const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& weight, const LocalTensor<float>& mask){

offset的大小为 $4×Wokhkw4\times W_o k_h k_w$ ，用于临时变量。
使用 Copy 指令取 $x_i$ 。
h为 $y_o$ ，auxH 加 $yo⋅sy_o \cdot s$ 后为实际坐标 $y_i$ 。
offset的前半部分为卷积窗口索引 $p + p_n$ 。

    int32_t h = taskIdx % hOut_;Copy<float, false>(offset, auxW, MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});Adds<float, false>(offset[alignedRowOffset_], auxH, float(h * strideH_), MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});

由于内存连续，一条加法指令实现浮点坐标的计算： $p_n+\Delta p_n$ 。

    Add<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8});

offsetInt转为整型坐标 $y_1, x_1)$ ，offset的后半部分存储浮点类型的左上角坐标。

    Cast<int32_t, float, false>(offsetInt, offset, RoundMode::CAST_FLOOR, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});Cast<float, int32_t, false>(offset[2 * alignedRowOffset_], offsetInt, RoundMode::CAST_NONE, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});

前半部分为差值 $y-y_1$ 和 $x-x_1$ 。
weight为1，因此后半部分为 $y_2-y$ 和 $x_2-x$ 。

    Sub<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // lw, lhDuplicate<float, false>(weight, 1.f, MASK_PLACEHOLDER, 2 * rptTimes_, 1, 8);Sub<float, false>(offset[2 * alignedRowOffset_], weight, offset, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // hw, hh

两个维度相乘得到4个点的权值：
$w11=(x2−x)(y2−y)w21=(x−x1)(y2−y)w12=(x2−x)(y−y1)w22=(x−x1)(y−y1)\begin{aligned} w_{11} &= (x_2 -x)(y_2 -y) \\ w_{21} &=(x -x_1)(y_2 -y) \\ w_{12} &=(x_2 -x)(y -y_1)\\ w_{22} &=(x -x_1)(y -y_1) \end{aligned}$

    Mul<float, false>(weight, offset[2 * alignedRowOffset_], offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // hw * hhMul<float, false>(weight[alignedRowOffset_], offset, offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], offset[alignedRowOffset_], offset[2 * alignedRowOffset_],MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], offset, offset[alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lh * lw

将调制权重 $Δm\Delta_m$ 乘到4个插值权重上。
weight 的形状为 $4×Wokhkw4\times W_o k_h k_w$ ，mask的形状为 $W_o k_h k_w$ ，二者不等长导致需要调用4次。
代码注释没有删除。

    if (modulated) {Mul<float, false>(weight, weight, mask, MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8});Mul<float, false>(weight[alignedRowOffset_], weight[alignedRowOffset_], mask, MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], weight[2 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], weight[3 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // lh * lw}}

DeformableConv2dKernel::ComputeBilinearInterpolation

offset没有用到。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeBilinearInterpolation(uint32_t w,const LocalTensor<float>& offset, const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& feature,const LocalTensor<float>& weight, const LocalTensor<float>& offsetOutput){

首先将offsetOutput清零，其形状为 $k_h k_w C_i$

    Duplicate<float, false>(offsetOutput, 0.f, MASK_PLACEHOLDER, kernelSize_ * valRptTimes_, 1, 8);uint8_t ping = 0;uint32_t kernelOffset = w * kernelSize_;SetFlag<HardEvent::V_MTE2>(0);SetFlag<HardEvent::V_MTE2>(1);

传入 $x_o$ 。
pw和ph为数组中的索引。
gmOffset为输入点的一维偏移。
SetFlag 同一核内不同流水之间的同步指令。

Ascend C最佳实践中建议尽量一次搬运较大的数据块。

#pragma bisheng auto_sync parallelfor (uint32_t kIdx = 0; kIdx < kernelSize_; ++kIdx) {uint32_t pw = kIdx + kernelOffset;uint32_t ph = pw + alignedRowOffset_;int32_t w0 = offsetInt.GetValue(pw);int32_t h0 = offsetInt.GetValue(ph);int32_t w1 = w0 + 1;int32_t h1 = h0 + 1;uint32_t outOffset = kIdx * cIn_;uint32_t ftOffset = ping * featureOffset_;WaitFlag<HardEvent::V_MTE2>(ping);

对于每个输入点 $(y, x)$ ，如果 $(y 1, x 1), (y 1, x 2), (y 2, x 1), (y 2, x 2)$ 均在图像内，则一次加载4个点。
Axpy 将输入元素与标量求积后，累加到目的元素。

        if (0 < h1 && h1 < hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpQuadValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == 0) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_;DataCopy(feature[ftOffset + 3 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}}SetFlag<HardEvent::V_MTE2>(ping);ping = 1 - ping;}

将插值得到 $k_h k_w C_i$ 分段拷出到形状为 $\times W_i\times \frac{k_h k_w C_i}{G}$ 的全局内存offsetOutputGm_中。

    SetFlag<HardEvent::V_MTE3>(calEvt_);WaitFlag<HardEvent::V_MTE3>(calEvt_);for (uint32_t i = 0; i < groups_; ++i) {DataCopy(offsetOutputGm_[dstOffset_ + rowInPerGroup_ * i], offsetOutput[i * cInPerGroup_], cpOffsetOutParams_);}dstOffset_ += kwInPerGroup_;WaitFlag<HardEvent::V_MTE2>(0);WaitFlag<HardEvent::V_MTE2>(1);}

DeformableConv2dKernel::ProcessCube

SetTensorB 设置矩阵乘的右矩阵 B 需要转置。
采用循环方式实现 Batch Matmul：weight 的形状为 $G×CoG×khkwCiGG\times \frac{C_o}{G} \times \frac{k_h k_wC_i}{G}$ ，im2col 的形状为 $\times W_o\times \frac{k_h k_w C_i}{G}$ ，输出为 $G×CoG×WoG\times \frac{C_o}{G}\times W_o$ 。
这使得多核输出形状为 $N×Ho×Co×WoN\times H_o \times C_o \times W_o$ 。

template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessCube(uint32_t taskIdx){uint64_t aOffset = 0;uint64_t bOffset = taskIdx * rowIn_;uint64_t cOffset = taskIdx * rowOut_;for (uint32_t i = 0; i < groups_; ++i) {mm_.SetTensorA(weightGm_[aOffset]);mm_.SetTensorB(offsetOutputGm_[bOffset], true);mm_.template IterateAll<false>(yGm_[cOffset]);aOffset += kernelPerGroup_;bOffset += rowInPerGroup_;cOffset += rowOutPerGroup_;}}

DeformableConv2dV2Kernel::Process

DeformableConv2dV2Kernel::ProcessVector 每次生成 im2col 矩阵的一行。累积cubeTileTaskCount_行后，调用 DeformableConv2dV2Kernel::ProcessCube 进行卷积。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::Process(){for (int32_t taskIdx = start_; taskIdx < end_; taskIdx++) {ProcessVector(taskIdx);int32_t innerCubeTaskIdx = (taskIdx - start_) % cubeTileTaskCount_;bool startCubeFlag = (innerCubeTaskIdx == cubeTileTaskCount_ - 1) || (taskIdx == end_ - 1);if (startCubeFlag) {ProcessCube(taskIdx, innerCubeTaskIdx);}}mm_.End();}

DeformableConv2dV2Kernel::ProcessVector

每次调用处理卷积输出特征图上一个点对应的输入，即 im2col 矩阵的一行。将一个 kernel window 中的值展开成一列，写入 img2colMatGm_ 中。
将taskIdx解码为对应的(n, h_out, w_out)。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessVector(uint32_t taskIdx){int16_t batchIdx = taskIdx / (featureMapSize_);int16_t hOutIdx = (taskIdx % (featureMapSize_)) / wOut_;int16_t wOutIdx = taskIdx % wOut_;

将当前taskIdx对应的 $Δp\Delta p$ 和 $Δm\Delta m$ 加载到本地内存。单次拷贝18或9个元素，过少。

    // CopyIn OffsetDataCopy(copyInOffsetLocal_, offsetGm_[taskIdx * OFFSET_SIZE], OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInOffsetEventID);if (modulated) {DataCopy(maskLocal_, maskGm_[taskIdx * X_OFFSET_SIZE], X_OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInMaskEventID);}WaitFlag<HardEvent::MTE2_V>(copyInOffsetEventID);

将交错存储的 $(Δy,Δx)(\Delta y, \Delta x)$ 分离开，存入独立的xOffsetLocal_和yOffsetLocal_缓冲区。
加上卷积窗口坐标得到 $pn+Δpnp_n +\Delta p_n$ 。

    GatherMask(xOffsetLocal_, copyInOffsetLocal_, 1, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);GatherMask(yOffsetLocal_, copyInOffsetLocal_, 2, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);Add(xOffsetLocal_, xOffsetLocal_, constKHIdxLocal_, X_OFFSET_ALIGNED_SIZE);Add(yOffsetLocal_, yOffsetLocal_, constKWIdxLocal_, X_OFFSET_ALIGNED_SIZE);

对浮点坐标 $(i+Δhi,j+Δwi)(i+\Delta h_i, j+\Delta w_i)$ 取整得到双线性插值所需的四个方向的坐标。
计算小数偏移。

    Floor(topPosLocal_, xOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Floor(leftPosLocal_, yOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Adds(bottomPosLocal_, topPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);Adds(rightPosLocal_, leftPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);

fracH和fracW为单个方向上的插值权重 $y-y_1$ 和 $x-x_1$ 。

    Sub(fracHLocal_, xOffsetLocal_, topPosLocal_, X_OFFSET_ALIGNED_SIZE);Sub(fracWLocal_, yOffsetLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);

用输出点的坐标减去卷积核的半径，从而找到与之对应的输入区域的起始位置。这里假定 $s_h =1, s_w=1$ ，然而算子入口并没有设置该条件。
计算卷积窗口左上角的坐标 $h_0,w_0)$ 的公式为：
$h0=ho⋅sh−phw0=wo⋅sw−pw\begin{aligned} h_0 &= h_o \cdot s_h - p_h \\ w_0 &= w_o \cdot s_w - p_w \end{aligned}$
与相对坐标相加得到卷积窗口所有点的坐标 $p_n +\Delta p_n$ 。
topPosLocal_和leftPosLocal_为 $y_1, x_1)$ 。

    // global positionAdds(topPosLocal_, topPosLocal_, hOutIdx - kH_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(leftPosLocal_, leftPosLocal_, wOutIdx - kW_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);

计算插值4组点在内存上的一维偏移：
$offset1=(y1⋅Wi+x1)Cioffset2=(y1⋅Wi+x2)Cioffset3=(y2⋅Wi+x1)Cioffset4=(y2⋅Wi+x2)Ci\begin{aligned} \mathrm{offset}_1 &=(y_1\cdot W_i + x_1)C_i\\ \mathrm{offset}_2 &=(y_1\cdot W_i + x_2)C_i\\ \mathrm{offset}_3 &=(y_2\cdot W_i + x_1)C_i\\ \mathrm{offset}_4 &=(y_2\cdot W_i + x_2)C_i \end{aligned}$
topLeftOffsetLocal_、topRightOffsetLocal_、bottomLeftOffsetLocal_、bottomRightOffsetLocal_4个变量在内存上是连续的所有可以使用一条指令处理。

    // global OffsetMuls(topPosLocal_, topPosLocal_, wOut_ + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Add(topLeftOffsetLocal_, topPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE); // global (h * wOut + w)Add(topRightOffsetLocal_, topPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomLeftOffsetLocal_, bottomPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomRightOffsetLocal_, bottomPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Muls(topLeftOffsetLocal_, topLeftOffsetLocal_, cIn_ + 0.0f, 4 * X_OFFSET_ALIGNED_SIZE);Adds(topLeftOffsetLocal_, topLeftOffsetLocal_, batchIdx * featureMapElementsSize_ + 0.0f,4 * X_OFFSET_ALIGNED_SIZE); // global offset

CompareScalar 逐元素比较一个 tensor 中的元素和另一个 Scalar 的大小，结果在输出的对应比特位。
topPosLocal_、bottomPosLocal_、leftPosLocal_、rightPosLocal_四个变量的内存是连续的，每个变量的大小为 X_OFFSET_ALIGNED_SIZE。这里直接使用了64作为长度。
可以看出，由于地址对齐限制，36个有效元素对齐到64。
inGlobalLocal_的大小为IN_GLOBAL_BUF_SIZE * sizeof(uint32_t)，存储4组点在两个方向上是否在边界内。
inGlobalLocal_为 uint32_t 类型，每条 CompareScalar 处理64个元素，保存到inGlobalLocal_中每段的前2两个元素中。
比较 $\le y_1,\enspace 0 \le y_2,\enspace 0 \le x_1,\enspace 0 \le x_2$ 以及 $y1<Hi,y2<Hi,x1<Wi,x2<Wiy_1< H_i,\enspace y_2 < H_i\enspace, x_1 < W_i,\enspace x_2 < W_i$ 。

    // in global flagCompareScalar(inGlobalLocal_.ReinterpretCast<uint8_t>(), topPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[8].ReinterpretCast<uint8_t>(), bottomPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[16].ReinterpretCast<uint8_t>(), leftPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[24].ReinterpretCast<uint8_t>(), rightPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[32].ReinterpretCast<uint8_t>(), topPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[40].ReinterpretCast<uint8_t>(), bottomPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[48].ReinterpretCast<uint8_t>(), leftPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[56].ReinterpretCast<uint8_t>(), rightPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);

合并两个方向的结果，即 $\le y_1 < H_i,\enspace 0 \le y_2 < H_i,\enspace 0 \le x_1 < W_i,\enspace 0 \le x_2 < W_i$ 。

    And(inGlobalLocal_[32].ReinterpretCast<uint16_t>(), inGlobalLocal_.ReinterpretCast<uint16_t>(),inGlobalLocal_[32].ReinterpretCast<uint16_t>(), 64);

计算合法的 $y_1, x_1)$ 和 $y_2, x_2)$ 。

    And(inGlobalLocal_.ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 32); // TopLeft, BottomRight

计算合法的 $y_1, x_2)$ 和 $y_2, x_1)$ 。

    And(inGlobalLocal_[16].ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[56].ReinterpretCast<uint16_t>(), 16); // TopRightAnd(inGlobalLocal_[24].ReinterpretCast<uint16_t>(), inGlobalLocal_[40].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 16); // BottomLeft

Select 根据selMask（用于选择的 Mask 掩码）的比特位值选取元素。
将4组点的越界位置设置为-1.0f，后续拷贝时可直接丢弃或处理为0。

    Select(topLeftOffsetLocal_, inGlobalLocal_.ReinterpretCast<uint16_t>(), topLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomRightOffsetLocal_, inGlobalLocal_[8].ReinterpretCast<uint16_t>(), bottomRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(topRightOffsetLocal_, inGlobalLocal_[16].ReinterpretCast<uint16_t>(), topRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomLeftOffsetLocal_, inGlobalLocal_[24].ReinterpretCast<uint16_t>(), bottomLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

需要插 scalar 等待 vector 的同步。
oneSubFracHLocal_和oneSubFracWLocal_的内存是连续的。
计算一维插值权重 $y_2 - y$ 和 $x_2 - x$ 。

    SetFlag<HardEvent::V_S>(V_SEventID);WaitFlag<HardEvent::V_S>(V_SEventID);Muls(oneSubFracHLocal_, fracHLocal_, -1.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(oneSubFracHLocal_, oneSubFracHLocal_, 1.0f, 2 * X_OFFSET_ALIGNED_SIZE); // 1-fracH, 1-fracW

调制权重乘到4个插值权重上： $Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)\Delta m(y -y_1),\enspace \Delta m(x -x_1),\enspace \Delta m(y_2 -y),\enspace \Delta m(x_2 -x)$ 。

    if (modulated) {WaitFlag<HardEvent::MTE2_V>(copyInMaskEventID);Mul(fracHLocal_, fracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);Mul(oneSubFracHLocal_, oneSubFracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);}

Brcb 给定一个输入张量，每一次取输入张量中的8个数填充到结果张量的8个 datablock（32Bytes）中去，每个数对应一个 datablock。
插值系数与输入相乘时需要进行低维广播。下面的计算中，二者不等长，将每个系数广播为 $Ci8\frac{C_i}{8}$ 。
fracHBroadcastLocal_空间大小为 $9×Ci8×block9\times \frac{C_i}{8}\times \mathrm{block}$ 。
brcbParams_中设置元素间隔为 $Ci64\frac{C_i}{64}$ 个 block，迭代间隔为 $Ci8\frac{C_i}{8}$ 个 block。即将 $C_i$ 八等分，等分位上的 datablock 为有效值，其他位置无效。
横跨空间大小 $16×Ci64×block=2Ci16\times \frac{C_i}{64}\times\mathrm{block} = 2C_i$ 。
fracHLocal_的每个元素填充到fracHBroadcastLocal_中的一个 datablock，相邻元素间隔8个 datablock，即 $Ci64\frac{C_i}{64}$ 。

    // BroadcastBrcb(fracHBroadcastLocal_, fracHLocal_, 2, brcbParams_);Brcb(fracWBroadcastLocal_, fracWLocal_, 2, brcbParams_);Brcb(oneSubFracHBroadcastLocal_, oneSubFracHLocal_, 2, brcbParams_);Brcb(oneSubFracWBroadcastLocal_, oneSubFracWLocal_, 2, brcbParams_);

DATA_BLOCK_SIZE 为8，FOUR_CORNERS 为4，X_OFFSET_ALIGNED_SIZE 为9。
maskForBroadcast_等于dataBlockPerInputChannel_ - DATA_BLOCK_SIZE。
通过一条 Copy 指令将第一个 datablock 的数据广播到 $C_i$ 中的其他块，形状为 $4×9×Ci4\times 9\times C_i$ 。
每次迭代拷贝的 block 数量为：
$N=⌈Mask8⌉=⌈Ci8−88⌉=⌈Ci64⌉−1\begin{aligned} N &= \lceil\frac{\mathrm{Mask}}{8}\rceil \\ &= \lceil\frac{\frac{C_i}{8}-8}{8}\rceil \\ &= \lceil\frac{C_i}{64}\rceil-1 \end{aligned}$
srcRepeatSize和dstRepeatSize参数设置为 $Ci64\frac{C_i}{64}$ 。
在第一步的广播中，相邻元素间隔 $Ci64\frac{C_i}{64}$ ，这使得每组插值权重有效值长度为 $9Ci8\frac{9C_i}{8}$ 。

    Copy(fracHBroadcastLocal_[DATA_BLOCK_SIZE], fracHBroadcastLocal_, maskForBroadcast_, FOUR_CORNERS * X_OFFSET_SIZE,copyParams_);

DeformableConv2dV2Kernel::CopyInFeature 函数根据topLeftOffsetLocal_和fracHBroadcastLocal_加载输入并插值。
然后将outFeatureLocal_中的结果拷贝到全局内存中。

    CopyInFeature();SetFlag<HardEvent::V_MTE3>(copyOutEventID);WaitFlag<HardEvent::V_MTE3>(copyOutEventID);DataCopyPad(img2colMatGm_[taskIdx * elementsCountPerTask_], outFeatureLocal_,{1, static_cast<uint32_t>(elementsCountPerTask_ * FP32_BYTE_SIZE), 0, 0, 0});}

DeformableConv2dV2Kernel::CopyInFeature

函数没有参数，导致看不出依赖的变量。
topLeft0等值应该与整数进行比较。
代码直接展开，似乎可以像 V1中那样写成 for 循环。
加载9个输入点的通道后，与权重相乘。
topLeftWeightLocal_为 $Δm⋅w11=Δm(y2−y)(x2−x)\Delta m \cdot w_{11}=\Delta m(y_2 - y)(x_2 -x)$ 。
topLeftWeightLocal_中仅前面的 $9Ci8\frac{9C_i}{8}$ 个元素有效。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::CopyInFeature(){int32_t topLeft0 = topLeftOffsetLocal_.GetValue(0);int32_t topLeft1 = topLeftOffsetLocal_.GetValue(1);int32_t topLeft2 = topLeftOffsetLocal_.GetValue(2);int32_t topLeft3 = topLeftOffsetLocal_.GetValue(3);int32_t topLeft4 = topLeftOffsetLocal_.GetValue(4);int32_t topLeft5 = topLeftOffsetLocal_.GetValue(5);int32_t topLeft6 = topLeftOffsetLocal_.GetValue(6);int32_t topLeft7 = topLeftOffsetLocal_.GetValue(7);int32_t topLeft8 = topLeftOffsetLocal_.GetValue(8);(topLeft0 == -1.0f) ? Duplicate(topLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[0 * cIn_], xGm_[topLeft0], cIn_);(topLeft1 == -1.0f) ? Duplicate(topLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[1 * cIn_], xGm_[topLeft1], cIn_);(topLeft2 == -1.0f) ? Duplicate(topLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[2 * cIn_], xGm_[topLeft2], cIn_);(topLeft3 == -1.0f) ? Duplicate(topLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[3 * cIn_], xGm_[topLeft3], cIn_);(topLeft4 == -1.0f) ? Duplicate(topLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[4 * cIn_], xGm_[topLeft4], cIn_);(topLeft5 == -1.0f) ? Duplicate(topLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[5 * cIn_], xGm_[topLeft5], cIn_);(topLeft6 == -1.0f) ? Duplicate(topLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[6 * cIn_], xGm_[topLeft6], cIn_);(topLeft7 == -1.0f) ? Duplicate(topLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[7 * cIn_], xGm_[topLeft7], cIn_);(topLeft8 == -1.0f) ? Duplicate(topLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[8 * cIn_], xGm_[topLeft8], cIn_);Mul(topLeftWeightLocal_, oneSubFracHBroadcastLocal_, oneSubFracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);

Mul 设置src1BlkStride为0，实现了低维广播的乘法。topLeftWeightLocal_的每个 datablock 与topLeftFeatureLocal_的连续的8个 datablock 相乘。
src1RepStride为1。
repeatTimes_等于 $9Ci8×8\frac{9C_i}{8\times 8}$ ，即总计处理 $9C_i$ 个元素。

想要实现9个点的乘法，权重需要以ci/DATA_SIZE_PER_REPEAT的长度分段放置。

    SetFlag<HardEvent::MTE3_V>(MTE3_VEventID);WaitFlag<HardEvent::MTE3_V>(MTE3_VEventID);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);Mul(outFeatureLocal_, topLeftFeatureLocal_, topLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});

    int32_t topRight0 = topRightOffsetLocal_.GetValue(0);int32_t topRight1 = topRightOffsetLocal_.GetValue(1);int32_t topRight2 = topRightOffsetLocal_.GetValue(2);int32_t topRight3 = topRightOffsetLocal_.GetValue(3);int32_t topRight4 = topRightOffsetLocal_.GetValue(4);int32_t topRight5 = topRightOffsetLocal_.GetValue(5);int32_t topRight6 = topRightOffsetLocal_.GetValue(6);int32_t topRight7 = topRightOffsetLocal_.GetValue(7);int32_t topRight8 = topRightOffsetLocal_.GetValue(8);(topRight0 == -1.0f) ? Duplicate(topRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[0 * cIn_], xGm_[topRight0], cIn_);(topRight1 == -1.0f) ? Duplicate(topRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[1 * cIn_], xGm_[topRight1], cIn_);(topRight2 == -1.0f) ? Duplicate(topRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[2 * cIn_], xGm_[topRight2], cIn_);(topRight3 == -1.0f) ? Duplicate(topRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[3 * cIn_], xGm_[topRight3], cIn_);(topRight4 == -1.0f) ? Duplicate(topRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[4 * cIn_], xGm_[topRight4], cIn_);(topRight5 == -1.0f) ? Duplicate(topRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[5 * cIn_], xGm_[topRight5], cIn_);(topRight6 == -1.0f) ? Duplicate(topRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[6 * cIn_], xGm_[topRight6], cIn_);(topRight7 == -1.0f) ? Duplicate(topRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[7 * cIn_], xGm_[topRight7], cIn_);(topRight8 == -1.0f) ? Duplicate(topRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[8 * cIn_], xGm_[topRight8], cIn_);Mul(topRightWeightLocal_, oneSubFracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, topRightFeatureLocal_, topRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomLeft0 = bottomLeftOffsetLocal_.GetValue(0);int32_t bottomLeft1 = bottomLeftOffsetLocal_.GetValue(1);int32_t bottomLeft2 = bottomLeftOffsetLocal_.GetValue(2);int32_t bottomLeft3 = bottomLeftOffsetLocal_.GetValue(3);int32_t bottomLeft4 = bottomLeftOffsetLocal_.GetValue(4);int32_t bottomLeft5 = bottomLeftOffsetLocal_.GetValue(5);int32_t bottomLeft6 = bottomLeftOffsetLocal_.GetValue(6);int32_t bottomLeft7 = bottomLeftOffsetLocal_.GetValue(7);int32_t bottomLeft8 = bottomLeftOffsetLocal_.GetValue(8);(bottomLeft0 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[0 * cIn_], xGm_[bottomLeft0], cIn_);(bottomLeft1 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[1 * cIn_], xGm_[bottomLeft1], cIn_);(bottomLeft2 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[2 * cIn_], xGm_[bottomLeft2], cIn_);(bottomLeft3 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[3 * cIn_], xGm_[bottomLeft3], cIn_);(bottomLeft4 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[4 * cIn_], xGm_[bottomLeft4], cIn_);(bottomLeft5 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[5 * cIn_], xGm_[bottomLeft5], cIn_);(bottomLeft6 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[6 * cIn_], xGm_[bottomLeft6], cIn_);(bottomLeft7 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[7 * cIn_], xGm_[bottomLeft7], cIn_);(bottomLeft8 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[8 * cIn_], xGm_[bottomLeft8], cIn_);Mul(bottomLeftWeightLocal_, oneSubFracWBroadcastLocal_, fracHBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomLeftFeatureLocal_, bottomLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomRight0 = bottomRightOffsetLocal_.GetValue(0);int32_t bottomRight1 = bottomRightOffsetLocal_.GetValue(1);int32_t bottomRight2 = bottomRightOffsetLocal_.GetValue(2);int32_t bottomRight3 = bottomRightOffsetLocal_.GetValue(3);int32_t bottomRight4 = bottomRightOffsetLocal_.GetValue(4);int32_t bottomRight5 = bottomRightOffsetLocal_.GetValue(5);int32_t bottomRight6 = bottomRightOffsetLocal_.GetValue(6);int32_t bottomRight7 = bottomRightOffsetLocal_.GetValue(7);int32_t bottomRight8 = bottomRightOffsetLocal_.GetValue(8);(bottomRight0 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[0 * cIn_], xGm_[bottomRight0], cIn_);(bottomRight1 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[1 * cIn_], xGm_[bottomRight1], cIn_);(bottomRight2 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[2 * cIn_], xGm_[bottomRight2], cIn_);(bottomRight3 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[3 * cIn_], xGm_[bottomRight3], cIn_);(bottomRight4 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[4 * cIn_], xGm_[bottomRight4], cIn_);(bottomRight5 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[5 * cIn_], xGm_[bottomRight5], cIn_);(bottomRight6 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[6 * cIn_], xGm_[bottomRight6], cIn_);(bottomRight7 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[7 * cIn_], xGm_[bottomRight7], cIn_);(bottomRight8 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[8 * cIn_], xGm_[bottomRight8], cIn_);Mul(bottomRightWeightLocal_, fracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomRightFeatureLocal_, bottomRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});}

DeformableConv2dV2Kernel::ProcessCube

innerCubeTaskIdx为末尾元素索引。这里假定起始索引为0，因此可以得到 im2col 的行数cubeTaskCount。
elementsCountPerTask_为 $k_h k_w C_i$ 。
aOffset和cOffset分别为当前核在 A 和 C 矩阵上的起始偏移。

template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessCube(uint32_t taskIdx, const int32_t& innerCubeTaskIdx){int32_t cubeTaskCount = innerCubeTaskIdx + 1;uint64_t aOffset = (taskIdx - innerCubeTaskIdx) * elementsCountPerTask_;uint64_t cOffset = (taskIdx - innerCubeTaskIdx) * cOut_;

SetTensorA 设置矩阵乘的左矩阵 A。
SetTensorB 设置矩阵乘的右矩阵B。
SetSingleShape 设置 Matmul 单核计算的形状 singleMIn，singleNIn，singleKIn，单位为元素。

IterateAll 计算出 singleCoreM * singleCoreN 大小的 C 矩阵。迭代顺序可通过 tiling 参数 iterateOrder 调整。

img2col 的形状为 $128×khkwCi128\times k_h k_w C_i$ ，weight 的形状为 $Co×khkwCiC_o \times k_h k_w C_i$ ，输出形状为 $128×Co128\times C_o$ 。

    mm_.SetTensorA(img2colMatGm_[aOffset]);mm_.SetTensorB(weightGm_, true);mm_.SetSingleShape(cubeTaskCount, cOut_, elementsCountPerTask_);mm_.template IterateAll<false>(yGm_[cOffset]);}

参考资料：

SetL2CacheHint
关于 Ascend C 的一些思考
Pushing the Limits: Huawei’s AI Chip Tests U.S. Export Controls
FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
昇腾310P使用记录
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
硬件架构抽象
非对齐场景
核函数
使用说明
NPU的硬化 Task Scheduler 介绍
Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads
Nvidia GPU与Huawei NPU
7.5. 计算调度与执行
面向昇腾处理器的高性能同步原语自动插入方法
同步控制简介
设置指定芯片的AI CPU、control CPU和data CPU数量
Broadcast
AIV 和 AIC 组合启动问题