Ascend DrivingSDK 中的 modulated_deform_conv2d(一)
Ascend DrivingSDK 是基于昇腾 NPU 平台开发的适用于自动驾驶场景的算子和模型加速库,提供了一系列高性能的算子和模型加速接口,支持 PyTorch 框架。
Ascend DrivingSDK 中的 modulated_deform_conv2d 是少有的融合算子,使用单个 kernel 完成 Deformable Convolution 的计算。然而由于910B 采用 vector core 和 cube core 分离架构,二者间的同步开销较大。910B 系列芯片拥有高达96MB 到192MB 的 L2缓存,并且在默认情况下开启。因此,modulated_deform_conv2d 算子的输入基本都在 L2缓存上。
modulated_deform_conv2d 在 Ascend C 层面有两个算子,v2针对3x3卷积进行了优化。奇怪的是这里没有选择新增 kernel 而是新增算子。不知出于何种原因,两个版本的算子参数列表顺序不同。
op | PreProcess | ComputeWeight | ComputeBilinearInterpolation | ProcessCube | Output |
---|---|---|---|---|---|
deformable_conv2d | 缓存一行的卷积窗口索引,复用 HoH_oHo 次 | 向量计算 WokhkwW_o k_h k_wWokhkw 个点的权重 | 单次加载 4Ci4C_i4Ci 数据插值 | 单次计算 WoW_oWo 的结果 | 暂存 im2col 用于 ∂L∂W\frac{\partial L}{\partial W}∂W∂L |
deformable_conv2d_v2 | / | 向量计算 khkwk_h k_wkhkw 个点的权重 | khkwk_h k_wkhkw 条加载 CiC_iCi 的指令然后插值 | 单次计算128个输出点 | / |
v2 性能比 v1好,原因应该是 v2中内存拷贝的同步间隔更长(9:4)。全局内存访问的延迟很高,v1单条指令拷贝 4Ci4C_i4Ci,v2单次加载 CiC_iCi,但是9条指令后才同步。v1 缓存 im2col 节省计算,但是又没有省太多。因为计算 ∂L∂Δmn\frac{\partial L}{\partial \Delta \mathbf{m}_n}∂Δmn∂L 时仍然需要双线性插值的结果。
两个算子功能不完备,例如不支持 half 类型、不支持 deform_group、不支持 bias 等。v2算子更为简陋。在此情况下无文档描述和防呆,使用时不免要费一些周折。不知出于何种原因,两个版本的算子参数列表顺序不同。难以想象这是工业级的代码,遑论车规。唯一的优点是像 AMD 一样开源,期待用户自己定位解决。
ModulatedDeformConv2dFunction
将输入转为 NHWC 格式。
class ModulatedDeformConv2dFunction(Function):@staticmethod@custom_fwd(cast_inputs=torch.float32)# pylint: disable=huawei-too-many-argumentsdef forward(ctx,x: torch.Tensor,offset: torch.Tensor,mask: torch.Tensor,weight: torch.Tensor,bias: Optional[nn.Parameter] = None,stride: Union[int, Tuple[int, ...]] = 1,padding: Union[int, Tuple[int, ...]] = 0,dilation: Union[int, Tuple[int, ...]] = 1,groups: int = 1,deformable_groups: int = 1,):ctx.kernel_size = [weight.size(2), weight.size(3)]ctx.stride = _pair(stride)ctx.padding = _pair(padding)ctx.dilation = _pair(dilation)ctx.groups = groupsctx.deformable_groups = deformable_groupsnhwc_x = x.permute(0, 2, 3, 1).contiguous()nhwc_offset = offset.permute(0, 2, 3, 1).contiguous()nhwc_weight = weight.permute(0, 2, 3, 1).contiguous()nhwc_mask = mask.permute(0, 2, 3, 1).contiguous()out, offset_output = mx_driving._C.modulated_deformable_conv2d(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)ctx.save_for_backward(nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output)return out
ModulatedDeformConv2dFunction.backward
将 ∂L∂Y\frac{\partial L}{\partial Y}∂Y∂L 转置为 N×Ho×Co×WoN\times H_o \times C_o \times W_oN×Ho×Co×Wo
@staticmethod@once_differentiable@custom_bwd# pylint: disable=huawei-too-many-arguments,too-many-return-valuesdef backward(ctx, grad_out):nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output = ctx.saved_tensorsnhwc_grad_out = grad_out.permute(0, 2, 1, 3).contiguous()grad_x, grad_weight, _, grad_offset, grad_mask = mx_driving._C.modulated_deformable_conv2d_backward(nhwc_x,nhwc_offset,nhwc_mask,nhwc_weight,None,offset_output,nhwc_grad_out,ctx.kernel_size,ctx.stride,ctx.padding,ctx.dilation,ctx.groups,ctx.deformable_groups,False,)return (grad_x,grad_offset,grad_mask,grad_weight,None,None,None,None,None,None,)
modulated_deformable_conv2d
TORCH_CHECK_NPU 检查输入张量是否都存储在 NPU 设备上。
std::tuple<at::Tensor, at::Tensor> modulated_deformable_conv2d(const at::Tensor& input, const at::Tensor& offset,const at::Tensor& mask, const at::Tensor& weight, const c10::optional<at::Tensor>& bias_opt,at::IntArrayRef kernel_size, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation,int64_t groups, int64_t deformable_groups, int64_t with_bias){TORCH_CHECK_NPU(input);TORCH_CHECK_NPU(offset);TORCH_CHECK_NPU(mask);TORCH_CHECK_NPU(weight);
对维度和参数进行检查。
TORCH_CHECK(input.dim() == INPUT_DIM, "input must to be a 4D Tensor, but got: ", input.dim());TORCH_CHECK(offset.dim() == INPUT_DIM, "offset has to be a 4D Tensor, but got: ", offset.dim());TORCH_CHECK(mask.dim() == INPUT_DIM, "mask has to be a 4D Tensor, but got: ", mask.dim());TORCH_CHECK(weight.dim() == INPUT_DIM, "weight has to be a 4D Tensor, but got: ", weight.dim());TORCH_CHECK(stride[0] > 0 && stride[1] > 0, "stride must be greater than 0");TORCH_CHECK(kernel_size[0] > 0 && kernel_size[1] > 0, "kernel_size must be greater than 0");TORCH_CHECK(dilation[0] > 0 && dilation[1] > 0, "dilation must be greater than 0");
c10::value_or_else 已经废弃了,推荐使用 std::optional::value_or。
安全地处理可选的bias_opt
。
const at::Tensor& bias = c10::value_or_else(bias_opt, [] { return at::Tensor(); });uint32_t n = static_cast<uint32_t>(input.size(0));uint32_t c_in = static_cast<uint32_t>(input.size(3));uint32_t h_in = static_cast<uint32_t>(input.size(1));uint32_t w_in = static_cast<uint32_t>(input.size(2));uint32_t h_out = static_cast<uint32_t>(offset.size(1));uint32_t w_out = static_cast<uint32_t>(offset.size(2));uint32_t c_out = static_cast<uint32_t>(weight.size(0));uint32_t kh = static_cast<uint32_t>(weight.size(1));uint32_t kw = static_cast<uint32_t>(weight.size(2));TORCH_CHECK(kh == kernel_size[0] && kw == kernel_size[1], "kernel size mismatch");TORCH_CHECK(mask.size(-1) == kh * kw, "The shape of the mask is invalid");TORCH_CHECK(groups > 0, "groups must be greater than 0");TORCH_CHECK(c_out % groups == 0, "weight's out channel should be divided by groups");TORCH_CHECK(c_in % groups == 0, "input's channel should be divided by groups");bool modulated = true;
如果是无分组卷积并且输入通道数为256或512,调用 DeformableConv2dV2,否则调用 DeformableConv2d。两个算子的参数顺序不同。
DeformableConv2dV2 算子有两个输出:output
的形状为 N×Ho×Wo×CoN \times H_o \times W_o \times C_oN×Ho×Wo×Co,offset_output
的形状为 N×HoWo×khkw×CiN\times H_o W_o \times k_h k_w \times C_iN×HoWo×khkw×Ci;
DeformableConv2d 算子有两个输出:output
的形状为 N×Ho×Co×WoN \times H_o \times C_o \times W_oN×Ho×Co×Wo,offset_output
的形状为 N×Ho×Wo×G×khkwCiGN\times H_o \times W_o \times G \times \frac{k_h k_w C_i}{G}N×Ho×Wo×G×GkhkwCi。
注意:DeformableConv2dV2 要求 khkw=9k_h k_w =9khkw=9,但是这里没有加判断条件。
if ((groups == 1) && ((c_in == CHANNEL_256) || (c_in == CHANNEL_512))) {at::Tensor output = at::empty({n, h_out, w_out, c_out}, input.options());at::Tensor offset_output = at::empty({n, h_out * w_out, kh * kw, c_in}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2dV2, input, offset, mask, weight, bias, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 3, 1, 2});return std::tie(output, offset_output);} else {at::Tensor output = at::empty({n, h_out, c_out, w_out}, input.options());at::Tensor offset_output = at::empty({n, h_out, w_out, groups, kh * kw * c_in / groups}, input.options());EXEC_NPU_CMD(aclnnDeformableConv2d, input, weight, bias, offset, mask, kernel_size, stride, padding, dilation,groups, deformable_groups, modulated, with_bias, output, offset_output);output = output.permute({0, 2, 1, 3});return std::tie(output, offset_output);}}
DeformableConv2dKernel::Process
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::Process(){PreProcess();for (uint32_t taskIdx = start_; taskIdx < end_; ++taskIdx) {ProcessVector(taskIdx);ProcessCube(taskIdx);}mm_.End();}
DeformableConv2dKernel::PreProcess
所有 VectorCore 协作完成一行输出所需的索引的计算,类似于 Allgather 模式。
TBuf:: Get 从 TBuf 上获取指定长度的 Tensor,或者获取全部长度的 Tensor。
auxH
和auxW
的大小为 WokhkwW_o k_h k_wWokhkw,存储一行输出的卷积窗口坐标 (hi,wi)(h_i, w_i)(hi,wi)。
wi_start=wo⋅s−phi_start=ho⋅s−p\begin{aligned} w_{i\_start}=w_o \cdot s−p\\ h_{i\_start}=h_o \cdot s−p \end{aligned} wi_start=wo⋅s−phi_start=ho⋅s−p
由于auxH
和auxW
预先计算后用于多行输出的索引,因此auxH
中是窗口内的相对偏移,没有加 ho⋅sh_o \cdot sho⋅s。
auxStart_
为当前核需要处理的起始索引。
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::PreProcess(){LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();uint32_t idx = 0;for (int32_t w = auxStart_; w < auxEnd_; ++w) {for (int32_t i = 0; i < kH_; ++i) {for (int32_t j = 0; j < kW_; ++j) {auxW.SetValue(idx, static_cast<float>(w * strideW_ - padW_ + j * dilationW_));auxH.SetValue(idx, static_cast<float>(-padH_ + i * dilationH_));++idx;}}}
GlobalTensor::operator[] 根据输入的offset
偏移返回新的 GlobalTensor。
valRptTimes_
是 CiC_iCi 拷贝次数。
每个核将本地计算的索引拷贝到auxWGm_
和auxHGm_
。
DataCopyPad(auxWGm_[auxStart_ * kernelSize_], auxW,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});DataCopyPad(auxHGm_[auxStart_ * kernelSize_], auxH,{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});SyncAll();
同步后,从全局内存中拷贝得到完整的一行输出所需的索引。
注意:这里有个问题是两次全局内存访问的延迟比较高。卷积核通常为3x3,如果 WoW_oWo 比较小的情况下,每个核自行计算比联合计算的开销可能更小。
DataCopy(auxW, auxWGm_, {1, rowOffsetBlk_, 0, 0});DataCopy(auxH, auxHGm_, {1, rowOffsetBlk_, 0, 0});
将feature
清零。
LocalTensor<float> feature = featureBuf_.Get<float>();Duplicate<float, false>(feature, 0.f, MASK_PLACEHOLDER, 4 * valRptTimes_, 1, 8);}
DeformableConv2dKernel::ProcessVector
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessVector(uint32_t taskIdx){uint32_t batch = taskIdx / hOut_;srcOffset_ = batch * hIn_ * wIn_ * cIn_;dstOffset_ = taskIdx * rowIn_;LocalTensor<float> offset = offsetBuf_.Get<float>();LocalTensor<float> auxW = auxWBuf_.Get<float>();LocalTensor<float> auxH = auxHBuf_.Get<float>();LocalTensor<int32_t> offsetInt = offsetIntBuf_.Get<int32_t>();LocalTensor<float> weight = weightBuf_.Get<float>();LocalTensor<float> feature = featureBuf_.Get<float>();LocalTensor<float> mask;if (modulated) {mask = maskBuf_.Get<float>();}LocalTensor<float> offsetOutput = offsetOutputBuf_.Get<float>();
DeformableConv2dKernel::CopyInOffset 拷贝一行的 Δpn\Delta p_nΔpn 和 Δm\Delta mΔm 并解交织 Δpn\Delta p_nΔpn。
DeformableConv2dKernel::ComputeWeight 计算采样位置和插值权重。
CopyInOffset(taskIdx, offset, mask);ComputeWeight(taskIdx, auxW, auxH, offset, offsetInt, weight, mask);SetFlag<HardEvent::V_MTE2>(calEvt_);WaitFlag<HardEvent::V_MTE2>(calEvt_);SetFlag<HardEvent::MTE3_V>(0);SetFlag<HardEvent::MTE3_V>(1);uint8_t ping = 0;
DeformableConv2dKernel::ComputeBilinearInterpolation 加载计算和保存。
for (uint32_t w = 0; w < wOut_; ++w) {WaitFlag<HardEvent::MTE3_V>(ping);ComputeBilinearInterpolation(w, offset, offsetInt, feature, weight, offsetOutput[ping * kwIn_]);SetFlag<HardEvent::MTE3_V>(ping);ping = 1 - ping;}WaitFlag<HardEvent::MTE3_V>(0);WaitFlag<HardEvent::MTE3_V>(1);}
DeformableConv2dKernel::CopyInOffset
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::CopyInOffset(uint32_t taskIdx, const LocalTensor<float>& offset, const LocalTensor<float>& mask){uint32_t offsetIdx = taskIdx * rowOffset_ * 2;DataCopy(offset, offsetGm_[offsetIdx], {1, doubleRowOffsetBlk_, 0, 0});if (modulated) {DataCopy(mask, maskGm_[taskIdx * rowOffset_], {1, rowOffsetBlk_, 0, 0});}SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);uint64_t cnt;GatherMask(offset[2 * alignedRowOffset_], offset, 2, false, MASK_PLACEHOLDER, gatherParams_, cnt);GatherMask(offset[3 * alignedRowOffset_], offset, 1, false, MASK_PLACEHOLDER, gatherParams_, cnt);SetVectorMask<float>(FULL_MASK, FULL_MASK);}
DeformableConv2dKernel::ComputeWeight
offset
是中间变量,offsetInt
和weight
是输出,但使用常量引用。
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeWeight(uint32_t taskIdx,const LocalTensor<float>& auxW, const LocalTensor<float>& auxH, const LocalTensor<float>& offset,const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& weight, const LocalTensor<float>& mask){
offset
的大小为 4×Wokhkw4\times W_o k_h k_w4×Wokhkw,用于临时变量。
使用 Copy 指令取 xix_ixi。
h
为 yoy_oyo,auxH
加 yo⋅sy_o \cdot syo⋅s 后为实际坐标 yiy_iyi。
offset
的前半部分为卷积窗口索引 p+pnp + p_np+pn。
int32_t h = taskIdx % hOut_;Copy<float, false>(offset, auxW, MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});Adds<float, false>(offset[alignedRowOffset_], auxH, float(h * strideH_), MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});
由于内存连续,一条加法指令实现浮点坐标的计算: p+pn+Δpnp + p_n+\Delta p_np+pn+Δpn。
Add<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8});
offsetInt
转为整型坐标 (y1,x1)(y_1, x_1)(y1,x1),offset
的后半部分存储浮点类型的左上角坐标。
Cast<int32_t, float, false>(offsetInt, offset, RoundMode::CAST_FLOOR, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});Cast<float, int32_t, false>(offset[2 * alignedRowOffset_], offsetInt, RoundMode::CAST_NONE, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});
前半部分为差值 y−y1y-y_1y−y1 和 x−x1x-x_1x−x1。
weight
为1,因此后半部分为 y2−yy_2-yy2−y 和 x2−xx_2-xx2−x。
Sub<float, false>(offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // lw, lhDuplicate<float, false>(weight, 1.f, MASK_PLACEHOLDER, 2 * rptTimes_, 1, 8);Sub<float, false>(offset[2 * alignedRowOffset_], weight, offset, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // hw, hh
两个维度相乘得到4个点的权值:
w11=(x2−x)(y2−y)w21=(x−x1)(y2−y)w12=(x2−x)(y−y1)w22=(x−x1)(y−y1)\begin{aligned} w_{11} &= (x_2 -x)(y_2 -y) \\ w_{21} &=(x -x_1)(y_2 -y) \\ w_{12} &=(x_2 -x)(y -y_1)\\ w_{22} &=(x -x_1)(y -y_1) \end{aligned} w11w21w12w22=(x2−x)(y2−y)=(x−x1)(y2−y)=(x2−x)(y−y1)=(x−x1)(y−y1)
Mul<float, false>(weight, offset[2 * alignedRowOffset_], offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // hw * hhMul<float, false>(weight[alignedRowOffset_], offset, offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], offset[alignedRowOffset_], offset[2 * alignedRowOffset_],MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], offset, offset[alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lh * lw
将调制权重 Δm\Delta_mΔm 乘到4个插值权重上。
weight
的形状为 4×Wokhkw4\times W_o k_h k_w4×Wokhkw,mask
的形状为 WokhkwW_o k_h k_wWokhkw,二者不等长导致需要调用4次。
代码注释没有删除。
if (modulated) {Mul<float, false>(weight, weight, mask, MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8});Mul<float, false>(weight[alignedRowOffset_], weight[alignedRowOffset_], mask, MASK_PLACEHOLDER, rptTimes_,{1, 1, 1, 8, 8, 8}); // lw * hhMul<float, false>(weight[2 * alignedRowOffset_], weight[2 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lhMul<float, false>(weight[3 * alignedRowOffset_], weight[3 * alignedRowOffset_], mask, MASK_PLACEHOLDER,rptTimes_, {1, 1, 1, 8, 8, 8}); // lh * lw}}
DeformableConv2dKernel::ComputeBilinearInterpolation
offset
没有用到。
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeBilinearInterpolation(uint32_t w,const LocalTensor<float>& offset, const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& feature,const LocalTensor<float>& weight, const LocalTensor<float>& offsetOutput){
首先将offsetOutput
清零,其形状为 khkwCik_h k_w C_ikhkwCi
Duplicate<float, false>(offsetOutput, 0.f, MASK_PLACEHOLDER, kernelSize_ * valRptTimes_, 1, 8);uint8_t ping = 0;uint32_t kernelOffset = w * kernelSize_;SetFlag<HardEvent::V_MTE2>(0);SetFlag<HardEvent::V_MTE2>(1);
传入 xox_oxo。
pw
和ph
为数组中的索引。
gmOffset
为输入点的一维偏移。
SetFlag 同一核内不同流水之间的同步指令。
Ascend C最佳实践 中建议尽量一次搬运较大的数据块。
#pragma bisheng auto_sync parallelfor (uint32_t kIdx = 0; kIdx < kernelSize_; ++kIdx) {uint32_t pw = kIdx + kernelOffset;uint32_t ph = pw + alignedRowOffset_;int32_t w0 = offsetInt.GetValue(pw);int32_t h0 = offsetInt.GetValue(ph);int32_t w1 = w0 + 1;int32_t h1 = h0 + 1;uint32_t outOffset = kIdx * cIn_;uint32_t ftOffset = ping * featureOffset_;WaitFlag<HardEvent::V_MTE2>(ping);
对于每个输入点 (y,x)(y, x)(y,x),如果 (y1,x1),(y1,x2),(y2,x1),(y2,x2)(y1,x1), (y1,x2), (y2,x1), (y2,x2)(y1,x1),(y1,x2),(y2,x1),(y2,x2) 均在图像内,则一次加载4个点。
Axpy 将输入元素与标量求积后,累加到目的元素。
if (0 < h1 && h1 < hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpQuadValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpColDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == 0) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_;DataCopy(feature[ftOffset + 3 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + w0 * cIn_;DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}} else if (h1 == hIn_) {if (0 < w1 && w1 < wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpRowDoubleValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == 0) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});} else if (w1 == wIn_) {uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;DataCopy(feature[ftOffset], xGm_[gmOffset], cpOneValParams_);SetFlag<HardEvent::MTE2_V>(copyEvt_);WaitFlag<HardEvent::MTE2_V>(copyEvt_);PipeBarrier<PIPE_V>();Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});}}SetFlag<HardEvent::V_MTE2>(ping);ping = 1 - ping;}
将插值得到 khkwCik_h k_w C_ikhkwCi 分段拷出到形状为 G×Wi×khkwCiGG \times W_i\times \frac{k_h k_w C_i}{G}G×Wi×GkhkwCi 的全局内存offsetOutputGm_
中。
SetFlag<HardEvent::V_MTE3>(calEvt_);WaitFlag<HardEvent::V_MTE3>(calEvt_);for (uint32_t i = 0; i < groups_; ++i) {DataCopy(offsetOutputGm_[dstOffset_ + rowInPerGroup_ * i], offsetOutput[i * cInPerGroup_], cpOffsetOutParams_);}dstOffset_ += kwInPerGroup_;WaitFlag<HardEvent::V_MTE2>(0);WaitFlag<HardEvent::V_MTE2>(1);}
DeformableConv2dKernel::ProcessCube
SetTensorB 设置矩阵乘的右矩阵 B 需要转置。
采用循环方式实现 Batch Matmul:weight 的形状为 G×CoG×khkwCiGG\times \frac{C_o}{G} \times \frac{k_h k_wC_i}{G}G×GCo×GkhkwCi,im2col 的形状为 G×Wo×khkwCiGG \times W_o\times \frac{k_h k_w C_i}{G}G×Wo×GkhkwCi,输出为 G×CoG×WoG\times \frac{C_o}{G}\times W_oG×GCo×Wo。
这使得多核输出形状为 N×Ho×Co×WoN\times H_o \times C_o \times W_oN×Ho×Co×Wo。
template<bool modulated>__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessCube(uint32_t taskIdx){uint64_t aOffset = 0;uint64_t bOffset = taskIdx * rowIn_;uint64_t cOffset = taskIdx * rowOut_;for (uint32_t i = 0; i < groups_; ++i) {mm_.SetTensorA(weightGm_[aOffset]);mm_.SetTensorB(offsetOutputGm_[bOffset], true);mm_.template IterateAll<false>(yGm_[cOffset]);aOffset += kernelPerGroup_;bOffset += rowInPerGroup_;cOffset += rowOutPerGroup_;}}
DeformableConv2dV2Kernel::Process
DeformableConv2dV2Kernel::ProcessVector 每次生成 im2col 矩阵的一行。累积cubeTileTaskCount_
行后,调用 DeformableConv2dV2Kernel::ProcessCube 进行卷积。
template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::Process(){for (int32_t taskIdx = start_; taskIdx < end_; taskIdx++) {ProcessVector(taskIdx);int32_t innerCubeTaskIdx = (taskIdx - start_) % cubeTileTaskCount_;bool startCubeFlag = (innerCubeTaskIdx == cubeTileTaskCount_ - 1) || (taskIdx == end_ - 1);if (startCubeFlag) {ProcessCube(taskIdx, innerCubeTaskIdx);}}mm_.End();}
DeformableConv2dV2Kernel::ProcessVector
每次调用处理卷积输出特征图上一个点对应的输入,即 im2col 矩阵的一行。将一个 kernel window 中的值展开成一列,写入 img2colMatGm_
中。
将taskIdx
解码为对应的(n, h_out, w_out)
。
template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessVector(uint32_t taskIdx){int16_t batchIdx = taskIdx / (featureMapSize_);int16_t hOutIdx = (taskIdx % (featureMapSize_)) / wOut_;int16_t wOutIdx = taskIdx % wOut_;
将当前taskIdx
对应的 Δp\Delta pΔp 和 Δm\Delta mΔm 加载到本地内存。单次拷贝18或9个元素,过少。
// CopyIn OffsetDataCopy(copyInOffsetLocal_, offsetGm_[taskIdx * OFFSET_SIZE], OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInOffsetEventID);if (modulated) {DataCopy(maskLocal_, maskGm_[taskIdx * X_OFFSET_SIZE], X_OFFSET_ALIGNED_SIZE);SetFlag<HardEvent::MTE2_V>(copyInMaskEventID);}WaitFlag<HardEvent::MTE2_V>(copyInOffsetEventID);
将交错存储的 (Δy,Δx)(\Delta y, \Delta x)(Δy,Δx) 分离开,存入独立的xOffsetLocal_
和yOffsetLocal_
缓冲区。
加上卷积窗口坐标得到 pn+Δpnp_n +\Delta p_npn+Δpn。
GatherMask(xOffsetLocal_, copyInOffsetLocal_, 1, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);GatherMask(yOffsetLocal_, copyInOffsetLocal_, 2, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);Add(xOffsetLocal_, xOffsetLocal_, constKHIdxLocal_, X_OFFSET_ALIGNED_SIZE);Add(yOffsetLocal_, yOffsetLocal_, constKWIdxLocal_, X_OFFSET_ALIGNED_SIZE);
对浮点坐标 (i+Δhi,j+Δwi)(i+\Delta h_i, j+\Delta w_i)(i+Δhi,j+Δwi) 取整得到双线性插值所需的四个方向的坐标。
计算小数偏移 。
Floor(topPosLocal_, xOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Floor(leftPosLocal_, yOffsetLocal_, X_OFFSET_ALIGNED_SIZE);Adds(bottomPosLocal_, topPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);Adds(rightPosLocal_, leftPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);
fracH
和fracW
为单个方向上的插值权重 y−y1y-y_1y−y1 和 x−x1x-x_1x−x1。
Sub(fracHLocal_, xOffsetLocal_, topPosLocal_, X_OFFSET_ALIGNED_SIZE);Sub(fracWLocal_, yOffsetLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);
用输出点的坐标减去卷积核的半径,从而找到与之对应的输入区域的起始位置。这里假定 sh=1,sw=1s_h =1, s_w=1sh=1,sw=1, 然而算子入口并没有设置该条件。
计算卷积窗口左上角的坐标 (h0,w0)(h_0,w_0)(h0,w0) 的公式为:
h0=ho⋅sh−phw0=wo⋅sw−pw\begin{aligned} h_0 &= h_o \cdot s_h - p_h \\ w_0 &= w_o \cdot s_w - p_w \end{aligned} h0w0=ho⋅sh−ph=wo⋅sw−pw
与相对坐标相加得到卷积窗口所有点的坐标 p+pn+Δpnp + p_n +\Delta p_np+pn+Δpn。
topPosLocal_
和leftPosLocal_
为 (y1,x1)(y_1, x_1)(y1,x1)。
// global positionAdds(topPosLocal_, topPosLocal_, hOutIdx - kH_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(leftPosLocal_, leftPosLocal_, wOutIdx - kW_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);
计算插值4组点在内存上的一维偏移:
offset1=(y1⋅Wi+x1)Cioffset2=(y1⋅Wi+x2)Cioffset3=(y2⋅Wi+x1)Cioffset4=(y2⋅Wi+x2)Ci\begin{aligned} \mathrm{offset}_1 &=(y_1\cdot W_i + x_1)C_i\\ \mathrm{offset}_2 &=(y_1\cdot W_i + x_2)C_i\\ \mathrm{offset}_3 &=(y_2\cdot W_i + x_1)C_i\\ \mathrm{offset}_4 &=(y_2\cdot W_i + x_2)C_i \end{aligned} offset1offset2offset3offset4=(y1⋅Wi+x1)Ci=(y1⋅Wi+x2)Ci=(y2⋅Wi+x1)Ci=(y2⋅Wi+x2)Ci
topLeftOffsetLocal_
、topRightOffsetLocal_
、bottomLeftOffsetLocal_
、bottomRightOffsetLocal_
4个变量在内存上是连续的所有可以使用一条指令处理。
// global OffsetMuls(topPosLocal_, topPosLocal_, wOut_ + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);Add(topLeftOffsetLocal_, topPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE); // global (h * wOut + w)Add(topRightOffsetLocal_, topPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomLeftOffsetLocal_, bottomPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);Add(bottomRightOffsetLocal_, bottomPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);Muls(topLeftOffsetLocal_, topLeftOffsetLocal_, cIn_ + 0.0f, 4 * X_OFFSET_ALIGNED_SIZE);Adds(topLeftOffsetLocal_, topLeftOffsetLocal_, batchIdx * featureMapElementsSize_ + 0.0f,4 * X_OFFSET_ALIGNED_SIZE); // global offset
CompareScalar 逐元素比较一个 tensor 中的元素和另一个 Scalar 的大小,结果在输出的对应比特位。
topPosLocal_
、bottomPosLocal_
、leftPosLocal_
、rightPosLocal_
四个变量的内存是连续的,每个变量的大小为 X_OFFSET_ALIGNED_SIZE。这里直接使用了64作为长度。
可以看出,由于地址对齐限制,36个有效元素对齐到64。
inGlobalLocal_
的大小为IN_GLOBAL_BUF_SIZE * sizeof(uint32_t)
,存储4组点在两个方向上是否在边界内。
inGlobalLocal_
为 uint32_t 类型,每条 CompareScalar 处理64个元素,保存到inGlobalLocal_
中每段的前2两个元素中。
比较 0≤y1,0≤y2,0≤x1,0≤x20 \le y_1,\enspace 0 \le y_2,\enspace 0 \le x_1,\enspace 0 \le x_20≤y1,0≤y2,0≤x1,0≤x2 以及 y1<Hi,y2<Hi,x1<Wi,x2<Wiy_1< H_i,\enspace y_2 < H_i\enspace, x_1 < W_i,\enspace x_2 < W_iy1<Hi,y2<Hi,x1<Wi,x2<Wi。
// in global flagCompareScalar(inGlobalLocal_.ReinterpretCast<uint8_t>(), topPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[8].ReinterpretCast<uint8_t>(), bottomPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[16].ReinterpretCast<uint8_t>(), leftPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[24].ReinterpretCast<uint8_t>(), rightPosLocal_, 0.0f, CMPMODE::GE, 64);CompareScalar(inGlobalLocal_[32].ReinterpretCast<uint8_t>(), topPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[40].ReinterpretCast<uint8_t>(), bottomPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[48].ReinterpretCast<uint8_t>(), leftPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);CompareScalar(inGlobalLocal_[56].ReinterpretCast<uint8_t>(), rightPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);
合并两个方向的结果,即 0≤y1<Hi,0≤y2<Hi,0≤x1<Wi,0≤x2<Wi0 \le y_1 < H_i,\enspace 0 \le y_2 < H_i,\enspace 0 \le x_1 < W_i,\enspace 0 \le x_2 < W_i0≤y1<Hi,0≤y2<Hi,0≤x1<Wi,0≤x2<Wi。
And(inGlobalLocal_[32].ReinterpretCast<uint16_t>(), inGlobalLocal_.ReinterpretCast<uint16_t>(),inGlobalLocal_[32].ReinterpretCast<uint16_t>(), 64);
计算合法的 (y1,x1)(y_1, x_1)(y1,x1) 和 (y2,x2)(y_2, x_2)(y2,x2)。
And(inGlobalLocal_.ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 32); // TopLeft, BottomRight
计算合法的 (y1,x2)(y_1, x_2)(y1,x2) 和 (y2,x1)(y_2, x_1)(y2,x1)。
And(inGlobalLocal_[16].ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),inGlobalLocal_[56].ReinterpretCast<uint16_t>(), 16); // TopRightAnd(inGlobalLocal_[24].ReinterpretCast<uint16_t>(), inGlobalLocal_[40].ReinterpretCast<uint16_t>(),inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 16); // BottomLeft
Select 根据selMask
(用于选择的 Mask 掩码)的比特位值选取元素。
将4组点的越界位置设置为-1.0f
,后续拷贝时可直接丢弃或处理为0。
Select(topLeftOffsetLocal_, inGlobalLocal_.ReinterpretCast<uint16_t>(), topLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomRightOffsetLocal_, inGlobalLocal_[8].ReinterpretCast<uint16_t>(), bottomRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(topRightOffsetLocal_, inGlobalLocal_[16].ReinterpretCast<uint16_t>(), topRightOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);Select(bottomLeftOffsetLocal_, inGlobalLocal_[24].ReinterpretCast<uint16_t>(), bottomLeftOffsetLocal_, -1.0f,SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);
需要插 scalar 等待 vector 的同步。
oneSubFracHLocal_
和oneSubFracWLocal_
的内存是连续的。
计算一维插值权重 y2−yy_2 - yy2−y 和 x2−xx_2 - xx2−x。
SetFlag<HardEvent::V_S>(V_SEventID);WaitFlag<HardEvent::V_S>(V_SEventID);Muls(oneSubFracHLocal_, fracHLocal_, -1.0f, 2 * X_OFFSET_ALIGNED_SIZE);Adds(oneSubFracHLocal_, oneSubFracHLocal_, 1.0f, 2 * X_OFFSET_ALIGNED_SIZE); // 1-fracH, 1-fracW
调制权重乘到4个插值权重上:Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)\Delta m(y -y_1),\enspace \Delta m(x -x_1),\enspace \Delta m(y_2 -y),\enspace \Delta m(x_2 -x)Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)。
if (modulated) {WaitFlag<HardEvent::MTE2_V>(copyInMaskEventID);Mul(fracHLocal_, fracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);Mul(oneSubFracHLocal_, oneSubFracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);}
Brcb 给定一个输入张量,每一次取输入张量中的8个数填充到结果张量的8个 datablock(32Bytes)中去,每个数对应一个 datablock。
插值系数与输入相乘时需要进行低维广播。下面的计算中,二者不等长,将每个系数广播为 Ci8\frac{C_i}{8}8Ci。
fracHBroadcastLocal_
空间大小为 9×Ci8×block9\times \frac{C_i}{8}\times \mathrm{block}9×8Ci×block。
brcbParams_
中设置元素间隔为 Ci64\frac{C_i}{64}64Ci 个 block,迭代间隔为 Ci8\frac{C_i}{8}8Ci 个 block。即将 CiC_iCi 八等分,等分位上的 datablock 为有效值,其他位置无效。
横跨空间大小 16×Ci64×block=2Ci16\times \frac{C_i}{64}\times\mathrm{block} = 2C_i16×64Ci×block=2Ci。
fracHLocal_
的每个元素填充到fracHBroadcastLocal_
中的一个 datablock,相邻元素间隔8个 datablock,即 Ci64\frac{C_i}{64}64Ci。
// BroadcastBrcb(fracHBroadcastLocal_, fracHLocal_, 2, brcbParams_);Brcb(fracWBroadcastLocal_, fracWLocal_, 2, brcbParams_);Brcb(oneSubFracHBroadcastLocal_, oneSubFracHLocal_, 2, brcbParams_);Brcb(oneSubFracWBroadcastLocal_, oneSubFracWLocal_, 2, brcbParams_);
DATA_BLOCK_SIZE 为8,FOUR_CORNERS 为4,X_OFFSET_ALIGNED_SIZE 为9。
maskForBroadcast_
等于dataBlockPerInputChannel_ - DATA_BLOCK_SIZE
。
通过一条 Copy 指令将第一个 datablock 的数据广播到 CiC_iCi 中的其他块,形状为 4×9×Ci4\times 9\times C_i4×9×Ci。
每次迭代拷贝的 block 数量为:
N=⌈Mask8⌉=⌈Ci8−88⌉=⌈Ci64⌉−1\begin{aligned} N &= \lceil\frac{\mathrm{Mask}}{8}\rceil \\ &= \lceil\frac{\frac{C_i}{8}-8}{8}\rceil \\ &= \lceil\frac{C_i}{64}\rceil-1 \end{aligned} N=⌈8Mask⌉=⌈88Ci−8⌉=⌈64Ci⌉−1
srcRepeatSize
和dstRepeatSize
参数设置为 Ci64\frac{C_i}{64}64Ci。
在第一步的广播中,相邻元素间隔 Ci64\frac{C_i}{64}64Ci,这使得每组插值权重有效值长度为 9Ci8\frac{9C_i}{8}89Ci。
Copy(fracHBroadcastLocal_[DATA_BLOCK_SIZE], fracHBroadcastLocal_, maskForBroadcast_, FOUR_CORNERS * X_OFFSET_SIZE,copyParams_);
DeformableConv2dV2Kernel::CopyInFeature 函数根据topLeftOffsetLocal_
和fracHBroadcastLocal_
加载输入并插值。
然后将outFeatureLocal_
中的结果拷贝到全局内存中。
CopyInFeature();SetFlag<HardEvent::V_MTE3>(copyOutEventID);WaitFlag<HardEvent::V_MTE3>(copyOutEventID);DataCopyPad(img2colMatGm_[taskIdx * elementsCountPerTask_], outFeatureLocal_,{1, static_cast<uint32_t>(elementsCountPerTask_ * FP32_BYTE_SIZE), 0, 0, 0});}
DeformableConv2dV2Kernel::CopyInFeature
函数没有参数,导致看不出依赖的变量。
topLeft0
等值应该与整数进行比较。
代码直接展开,似乎可以像 V1中那样写成 for 循环。
加载9个输入点的通道后,与权重相乘。
topLeftWeightLocal_
为 Δm⋅w11=Δm(y2−y)(x2−x)\Delta m \cdot w_{11}=\Delta m(y_2 - y)(x_2 -x)Δm⋅w11=Δm(y2−y)(x2−x)。
topLeftWeightLocal_
中仅前面的 9Ci8\frac{9C_i}{8}89Ci 个元素有效。
template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::CopyInFeature(){int32_t topLeft0 = topLeftOffsetLocal_.GetValue(0);int32_t topLeft1 = topLeftOffsetLocal_.GetValue(1);int32_t topLeft2 = topLeftOffsetLocal_.GetValue(2);int32_t topLeft3 = topLeftOffsetLocal_.GetValue(3);int32_t topLeft4 = topLeftOffsetLocal_.GetValue(4);int32_t topLeft5 = topLeftOffsetLocal_.GetValue(5);int32_t topLeft6 = topLeftOffsetLocal_.GetValue(6);int32_t topLeft7 = topLeftOffsetLocal_.GetValue(7);int32_t topLeft8 = topLeftOffsetLocal_.GetValue(8);(topLeft0 == -1.0f) ? Duplicate(topLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[0 * cIn_], xGm_[topLeft0], cIn_);(topLeft1 == -1.0f) ? Duplicate(topLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[1 * cIn_], xGm_[topLeft1], cIn_);(topLeft2 == -1.0f) ? Duplicate(topLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[2 * cIn_], xGm_[topLeft2], cIn_);(topLeft3 == -1.0f) ? Duplicate(topLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[3 * cIn_], xGm_[topLeft3], cIn_);(topLeft4 == -1.0f) ? Duplicate(topLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[4 * cIn_], xGm_[topLeft4], cIn_);(topLeft5 == -1.0f) ? Duplicate(topLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[5 * cIn_], xGm_[topLeft5], cIn_);(topLeft6 == -1.0f) ? Duplicate(topLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[6 * cIn_], xGm_[topLeft6], cIn_);(topLeft7 == -1.0f) ? Duplicate(topLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[7 * cIn_], xGm_[topLeft7], cIn_);(topLeft8 == -1.0f) ? Duplicate(topLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topLeftFeatureLocal_[8 * cIn_], xGm_[topLeft8], cIn_);Mul(topLeftWeightLocal_, oneSubFracHBroadcastLocal_, oneSubFracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);
Mul 设置src1BlkStride
为0,实现了低维广播的乘法。topLeftWeightLocal_
的每个 datablock 与topLeftFeatureLocal_
的连续的8个 datablock 相乘。
src1RepStride
为1。
repeatTimes_
等于 9Ci8×8\frac{9C_i}{8\times 8}8×89Ci,即总计处理 9Ci9C_i9Ci 个元素。
想要实现9个点的乘法,权重需要以ci/DATA_SIZE_PER_REPEAT
的长度分段放置。
SetFlag<HardEvent::MTE3_V>(MTE3_VEventID);WaitFlag<HardEvent::MTE3_V>(MTE3_VEventID);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);Mul(outFeatureLocal_, topLeftFeatureLocal_, topLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
int32_t topRight0 = topRightOffsetLocal_.GetValue(0);int32_t topRight1 = topRightOffsetLocal_.GetValue(1);int32_t topRight2 = topRightOffsetLocal_.GetValue(2);int32_t topRight3 = topRightOffsetLocal_.GetValue(3);int32_t topRight4 = topRightOffsetLocal_.GetValue(4);int32_t topRight5 = topRightOffsetLocal_.GetValue(5);int32_t topRight6 = topRightOffsetLocal_.GetValue(6);int32_t topRight7 = topRightOffsetLocal_.GetValue(7);int32_t topRight8 = topRightOffsetLocal_.GetValue(8);(topRight0 == -1.0f) ? Duplicate(topRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[0 * cIn_], xGm_[topRight0], cIn_);(topRight1 == -1.0f) ? Duplicate(topRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[1 * cIn_], xGm_[topRight1], cIn_);(topRight2 == -1.0f) ? Duplicate(topRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[2 * cIn_], xGm_[topRight2], cIn_);(topRight3 == -1.0f) ? Duplicate(topRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[3 * cIn_], xGm_[topRight3], cIn_);(topRight4 == -1.0f) ? Duplicate(topRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[4 * cIn_], xGm_[topRight4], cIn_);(topRight5 == -1.0f) ? Duplicate(topRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[5 * cIn_], xGm_[topRight5], cIn_);(topRight6 == -1.0f) ? Duplicate(topRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[6 * cIn_], xGm_[topRight6], cIn_);(topRight7 == -1.0f) ? Duplicate(topRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[7 * cIn_], xGm_[topRight7], cIn_);(topRight8 == -1.0f) ? Duplicate(topRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(topRightFeatureLocal_[8 * cIn_], xGm_[topRight8], cIn_);Mul(topRightWeightLocal_, oneSubFracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, topRightFeatureLocal_, topRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomLeft0 = bottomLeftOffsetLocal_.GetValue(0);int32_t bottomLeft1 = bottomLeftOffsetLocal_.GetValue(1);int32_t bottomLeft2 = bottomLeftOffsetLocal_.GetValue(2);int32_t bottomLeft3 = bottomLeftOffsetLocal_.GetValue(3);int32_t bottomLeft4 = bottomLeftOffsetLocal_.GetValue(4);int32_t bottomLeft5 = bottomLeftOffsetLocal_.GetValue(5);int32_t bottomLeft6 = bottomLeftOffsetLocal_.GetValue(6);int32_t bottomLeft7 = bottomLeftOffsetLocal_.GetValue(7);int32_t bottomLeft8 = bottomLeftOffsetLocal_.GetValue(8);(bottomLeft0 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[0 * cIn_], xGm_[bottomLeft0], cIn_);(bottomLeft1 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[1 * cIn_], xGm_[bottomLeft1], cIn_);(bottomLeft2 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[2 * cIn_], xGm_[bottomLeft2], cIn_);(bottomLeft3 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[3 * cIn_], xGm_[bottomLeft3], cIn_);(bottomLeft4 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[4 * cIn_], xGm_[bottomLeft4], cIn_);(bottomLeft5 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[5 * cIn_], xGm_[bottomLeft5], cIn_);(bottomLeft6 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[6 * cIn_], xGm_[bottomLeft6], cIn_);(bottomLeft7 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[7 * cIn_], xGm_[bottomLeft7], cIn_);(bottomLeft8 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomLeftFeatureLocal_[8 * cIn_], xGm_[bottomLeft8], cIn_);Mul(bottomLeftWeightLocal_, oneSubFracWBroadcastLocal_, fracHBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomLeftFeatureLocal_, bottomLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});int32_t bottomRight0 = bottomRightOffsetLocal_.GetValue(0);int32_t bottomRight1 = bottomRightOffsetLocal_.GetValue(1);int32_t bottomRight2 = bottomRightOffsetLocal_.GetValue(2);int32_t bottomRight3 = bottomRightOffsetLocal_.GetValue(3);int32_t bottomRight4 = bottomRightOffsetLocal_.GetValue(4);int32_t bottomRight5 = bottomRightOffsetLocal_.GetValue(5);int32_t bottomRight6 = bottomRightOffsetLocal_.GetValue(6);int32_t bottomRight7 = bottomRightOffsetLocal_.GetValue(7);int32_t bottomRight8 = bottomRightOffsetLocal_.GetValue(8);(bottomRight0 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[0 * cIn_], xGm_[bottomRight0], cIn_);(bottomRight1 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[1 * cIn_], xGm_[bottomRight1], cIn_);(bottomRight2 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[2 * cIn_], xGm_[bottomRight2], cIn_);(bottomRight3 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[3 * cIn_], xGm_[bottomRight3], cIn_);(bottomRight4 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[4 * cIn_], xGm_[bottomRight4], cIn_);(bottomRight5 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[5 * cIn_], xGm_[bottomRight5], cIn_);(bottomRight6 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[6 * cIn_], xGm_[bottomRight6], cIn_);(bottomRight7 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[7 * cIn_], xGm_[bottomRight7], cIn_);(bottomRight8 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :DataCopy(bottomRightFeatureLocal_[8 * cIn_], xGm_[bottomRight8], cIn_);Mul(bottomRightWeightLocal_, fracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);MulAddDst(outFeatureLocal_, bottomRightFeatureLocal_, bottomRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});}
DeformableConv2dV2Kernel::ProcessCube
innerCubeTaskIdx
为末尾元素索引。这里假定起始索引为0,因此可以得到 im2col 的行数cubeTaskCount
。
elementsCountPerTask_
为 khkwCik_h k_w C_ikhkwCi。
aOffset
和cOffset
分别为当前核在 A 和 C 矩阵上的起始偏移。
template<bool modulated>__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessCube(uint32_t taskIdx, const int32_t& innerCubeTaskIdx){int32_t cubeTaskCount = innerCubeTaskIdx + 1;uint64_t aOffset = (taskIdx - innerCubeTaskIdx) * elementsCountPerTask_;uint64_t cOffset = (taskIdx - innerCubeTaskIdx) * cOut_;
SetTensorA 设置矩阵乘的左矩阵 A。
SetTensorB 设置矩阵乘的右矩阵B。
SetSingleShape 设置 Matmul 单核计算的形状 singleMIn,singleNIn,singleKIn,单位为元素。
IterateAll 计算出 singleCoreM * singleCoreN 大小的 C 矩阵。迭代顺序可通过 tiling 参数 iterateOrder 调整。
img2col 的形状为 128×khkwCi128\times k_h k_w C_i128×khkwCi,weight 的形状为 Co×khkwCiC_o \times k_h k_w C_iCo×khkwCi,输出形状为 128×Co128\times C_o128×Co。
mm_.SetTensorA(img2colMatGm_[aOffset]);mm_.SetTensorB(weightGm_, true);mm_.SetSingleShape(cubeTaskCount, cOut_, elementsCountPerTask_);mm_.template IterateAll<false>(yGm_[cOffset]);}
参考资料:
- SetL2CacheHint
- 关于 Ascend C 的一些思考
- Pushing the Limits: Huawei’s AI Chip Tests U.S. Export Controls
- FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
- 昇腾310P使用记录
- PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
- 硬件架构抽象
- 非对齐场景
- 核函数
- 使用说明
- NPU的硬化 Task Scheduler 介绍
- Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads
- Nvidia GPU与Huawei NPU
- 7.5. 计算调度与执行
- 面向昇腾处理器的高性能同步原语自动插入方法
- 同步控制简介
- 设置指定芯片的AI CPU、control CPU和data CPU数量
- Broadcast
- AIV 和 AIC 组合启动问题