当前位置：首页 > news >正文

VTA学习笔记

news 2025/9/29 4:29:20

VTA架构描述

VTA概述

VTA是一种通用深度学习加速器，专为快速高效的密集线性代数而构建。

作为一个开源项目，可以在 https://tvm.hyper.ai/docs/topic/vta 获取更详细的资料。

本文结合VTA硬件指南和3rdparty/vta-hw/hardware/xilinx/目录下的相关源代码，简要对硬件指南做相应的补充和完善，并结合卷积测试加深对VTA硬件架构的理解。

VTA硬件组织

在这里插入图片描述

VTA的硬件组织结构主要分为LOAD COMPUTE STORE三个MODULE。
INSTRUCTION FETCH MODULE会对DRAM中的所有指令做部分解码，根据下面描述的规则将每个指令分发到对应的指令队列。

// Push to appropriate instruction queue
if (opcode == VTA_OPCODE_STORE) {store_queue.write(raw_insn);
} else if (opcode == VTA_OPCODE_LOAD) {if (memory_type == VTA_MEM_ID_INP || memory_type == VTA_MEM_ID_WGT) {load_queue.write(raw_insn);} else {gemm_queue.write(raw_insn);}
} else {gemm_queue.write(raw_insn);
}

opcode=STORE的指令全部推送到STORE CMD Q。
opcode=STORE与memory_type=OUT绑定，STORE模块将OUTPUT BUFFER的数据载入DRAM。

opcode=LOAD && memory_type=INP/WGT的指令全部推送到LOAD CMD Q。
LOAD模块将数据从DRAM下载到INPUT BUFFER和WEIGHT BUFFER。

opcode=ALU/GEMM/FINISH || opcode=LOAD && memory_type=UOP/ACC的指令全部推送到COMPUTE CMD Q。
memory_type=UOP对应MICRO-OP CACHE，里面存储了ALU/GEMM运算数据的地址信息。
memory_type=ACC对应REGISTER FILE，作用(1)从DRAM读取bias(2)作为累加器存储部分和累加值。
ALU和GEMM的运算主要由MICRO-OP CACHE和INSTRUCTION中的循环变量字段来控制。

数据流执行

在这里插入图片描述

/*! \brief Pop dependence token from load stage */
uint64_t pop_prev_dep   : 1;
/*! \brief Pop dependence token from store stage */
uint64_t pop_next_dep   : 1;
/*! \brief Push dependence token to load stage */
uint64_t push_prev_dep  : 1;
/*! \brief Push dependence token to store stage */
uint64_t push_next_dep  : 1;

VTA主要依靠四个依赖队列(dep_queue)以及每个指令当中的pop/push_prev/next四个bit来控制三个模块之间的交互。

/*
pop：排出token
push：推送token
prev：当前模块视角的前一个模块
next：当前模块视角的后一个模块load <---> compute <---> store<--- prev   next --->1.pop_prev=pop_next=0
2.pop_prev=0 pop_next=1 那么STORE必须已经向GEMM push token
3.pop_prev=1 pop_next=0 那么LOAD必须已经向GEMM push token
4.pop_prev=1 pop_next=1 那么LOAD和STORE都必须已经向GEMM push token
*/
if ( (insn.pop_prev_dep && !l2g_dep_queue.empty() && insn.pop_next_dep && !s2g_dep_queue.empty()) || (!insn.pop_prev_dep && insn.pop_next_dep && !s2g_dep_queue.empty()) || (insn.pop_prev_dep && !l2g_dep_queue.empty() && !insn.pop_next_dep) || (!insn.pop_prev_dep && !insn.pop_next_dep) )

当执行COMPUTE模块时，会先检查指令字段是否要pop前后两个模块的tokens。
若pop=1但依赖队列中没有token，那么就不执行COMPUTE，等待LOAD或者STORE执行完毕向依赖队列中push token之后再执行COMPUTE。
当COMPUTE执行最后一次ALU或GEMM之后，会向STORE和LOAD push token，来驱动输出数据的STORE和后续输入数据的LOAD。

内存分配

VTA遵循参数化设计，不同目标设备(target)在编译过程中使用参数的具体数值有差别，PYNQ-Z2是后续所列参数的具体数值的目标设备。

1.四个依赖队列(dep_queue)
每个依赖队列深度为256，队列中每个token占1bit。

// Dependence queues
hls::stream<bool> l2g_dep_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=l2g_dep_queue)
hls::stream<bool> s2g_dep_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=s2g_dep_queue)
hls::stream<bool> g2l_dep_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=g2l_dep_queue)
hls::stream<bool> g2s_dep_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=g2s_dep_queue)

2.三个模块的CMD队列
每个指令队列的深度为256，每条指令占128bit。

// Instatiate physical instruction queues
hls::stream<insn_T> load_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=load_queue)
hls::stream<insn_T> gemm_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=gemm_queue)
hls::stream<insn_T> store_queue;
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=store_queue)

3.三个SRAM BUFFER
INPUT BUFFER：位宽×行×列 = 64×2048×2 bit
WEIGHT BUFFER：位宽×行×列 = 64×1024×32 bit
OUTPUT BUFFER：位宽×行×列 = 64×2048×2 bit

// Instantiate memories
bus_T inp_mem[VTA_INP_BUFF_DEPTH][INP_MAT_AXI_RATIO];
bus_T wgt_mem[VTA_WGT_BUFF_DEPTH][WGT_MAT_AXI_RATIO];
bus_T out_mem[VTA_ACC_BUFF_DEPTH][OUT_MAT_AXI_RATIO];

4.一个MICRO-OP CACHE和一个REGISTER FILE
MICRO-OP CACHE：位宽×列 = 32×8192 bit
REGISTER FILE：位宽×行×列 = 64×2048×8 bit

// Micro-op storage
static uop_T uop_mem[VTA_UOP_BUFF_DEPTH];
// Accumulator storage
static bus_T acc_mem[VTA_ACC_BUFF_DEPTH][ACC_MAT_AXI_RATIO];

指令集架构(ISA)

在这里插入图片描述

LOAD/STORE指令解析

typedef struct {/*! \brief The instruction opcode */uint64_t opcode         : VTA_OPCODE_BIT_WIDTH;/*! \brief dep_flags */……/*! \brief Source/destination SRAM for store/load instruction */uint64_t memory_type    : VTA_MEMOP_ID_BIT_WIDTH;/*! \brief SRAM base address (pointer to memory elem type) */uint64_t sram_base      : VTA_MEMOP_SRAM_ADDR_BIT_WIDTH;/*! \brief DRAM base address (pointer to memory elem type) */uint64_t dram_base      : VTA_MEMOP_DRAM_ADDR_BIT_WIDTH;/*! \brief 2D access pattern: y-size */uint64_t y_size         : VTA_MEMOP_SIZE_BIT_WIDTH;/*! \brief 2D access pattern: x-size (in terms of memory elements) */uint64_t x_size         : VTA_MEMOP_SIZE_BIT_WIDTH;/*! \brief 2D access pattern: x-stride (in terms of memory elements) */uint64_t x_stride       : VTA_MEMOP_STRIDE_BIT_WIDTH;/*! \brief 2D access pattern: start padding along y dimension */uint64_t y_pad_0        : VTA_MEMOP_PAD_BIT_WIDTH;/*! \brief 2D access pattern: end padding along y dimension */uint64_t y_pad_1        : VTA_MEMOP_PAD_BIT_WIDTH;/*! \brief 2D access pattern: start padding along x dimension */uint64_t x_pad_0        : VTA_MEMOP_PAD_BIT_WIDTH;/*! \brief 2D access pattern: end padding along x dimension */uint64_t x_pad_1        : VTA_MEMOP_PAD_BIT_WIDTH;
} VTAMemInsn;

opcode = LOAD/STORE
memory_type = UOP/WGT/INP/ACC/OUT
y_size：一个指令操作的block行数
x_size：一个指令操作的block列数
x_stride：一个block在列方向上的元素个数
pad：填充参数

GEMM指令解析

typedef struct {/*! \brief The instruction opcode */uint64_t opcode         : VTA_OPCODE_BIT_WIDTH;/*! \brief dep_flags */……/*! \brief Reset register */uint64_t reset_reg      : 1;/*! \brief Micro-op begin address */uint64_t uop_bgn        : VTA_LOG_UOP_BUFF_DEPTH;/*! \brief Micro-op end address */uint64_t uop_end        : VTA_LOG_UOP_BUFF_DEPTH + 1;/*! \brief Iterations in the outer uop execution loop */uint64_t iter_out       : VTA_LOOP_ITER_WIDTH;/*! \brief Iterations in the inner uop execution loop */uint64_t iter_in        : VTA_LOOP_ITER_WIDTH;/*! \brief Outer loop accumulator memory index factor */uint64_t dst_factor_out : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief Inner loop accumulator memory index factor */uint64_t dst_factor_in  : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief Outer loop input memory index factor */uint64_t src_factor_out : VTA_LOG_INP_BUFF_DEPTH;/*! \brief Inner loop input memory index factor */uint64_t src_factor_in  : VTA_LOG_INP_BUFF_DEPTH;/*! \brief Outer loop weight memory index factor */uint64_t wgt_factor_out : VTA_LOG_WGT_BUFF_DEPTH;/*! \brief Inner loop weight memory index factor */uint64_t wgt_factor_in  : VTA_LOG_WGT_BUFF_DEPTH;
} VTAGemInsn;

opcode = GEMM/FINISH
reset_reg：gemm运算结果张量写回累加器是否要清零
uop_bgn/end：在micro op上迭代的范围
iter_out/in：Outer/Inner Loop 迭代次数
dst_factor_out/in：Outer/Inner Loop中biases/outputs地址的偏移量
src_factor_out/in：Outer/Inner Loop中inputs地址的偏移量
wgt_factor_out/in：Outer/Inner Loop中weights地址的偏移量

// Loop offset
acc_idx_T dst_offset_out = 0;
inp_idx_T src_offset_out = 0;
wgt_idx_T wgt_offset_out = 0;
// Outer Loop
EXE_OUT_LOOP: for (int it_out = 0; it_out < insn.iter_out; it_out++) {acc_idx_T dst_offset_in = dst_offset_out;inp_idx_T src_offset_in = src_offset_out;wgt_idx_T wgt_offset_in = wgt_offset_out;// Inner LoopEXE_IN_LOOP: for (int it_in = 0; it_in < insn.iter_in; it_in++) {// Iterate over micro opREAD_GEMM_UOP: for (int upc = insn.uop_bgn; upc < insn.uop_end; upc++) {……}// Update offsetsdst_offset_in += insn.dst_factor_in;src_offset_in += insn.src_factor_in;wgt_offset_in += insn.wgt_factor_in;}// Update offsetsdst_offset_out += insn.dst_factor_out;src_offset_out += insn.src_factor_out;wgt_offset_out += insn.wgt_factor_out;
}

ALU指令解析

typedef struct {/*! \brief The instruction opcode */uint64_t opcode         : VTA_OPCODE_BIT_WIDTH;/*! \brief dep_flags */……/*! \brief Reset register */uint64_t reset_reg      : 1;/*! \brief Micro-op begin address */uint64_t uop_bgn        : VTA_LOG_UOP_BUFF_DEPTH;/*! \brief Micro-op end address */uint64_t uop_end        : VTA_LOG_UOP_BUFF_DEPTH + 1;/*! \brief Iterations in the outer uop execution loop */uint64_t iter_out       : VTA_LOOP_ITER_WIDTH;/*! \brief Iterations in the inner uop execution loop */uint64_t iter_in        : VTA_LOOP_ITER_WIDTH;/*! \brief Outer loop accumulator memory destination index factor */uint64_t dst_factor_out : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief Inner loop accumulator memory destination index factor */uint64_t dst_factor_in  : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief Outer loop accumulator memory source index factor */uint64_t src_factor_out : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief Inner loop accumulator memory source index factor */uint64_t src_factor_in  : VTA_LOG_ACC_BUFF_DEPTH;/*! \brief ALU opcode */uint64_t alu_opcode     : VTA_ALU_OPCODE_BIT_WIDTH;/*! \brief Use immediate is true */uint64_t use_imm        : 1;/*! \brief Immediate value: allow negative value */int64_t imm            : VTA_ALUOP_IMM_BIT_WIDTH;
} VTAAluInsn;

opcode = ALU
uop_bgn/end：在micro op上迭代的范围
iter_out/in：Outer/Inner Loop 迭代次数
dst_factor_out/in：Outer/Inner Loop中dst地址的偏移量
src_factor_out/in：Outer/Inner Loop中src地址的偏移量
alu_opcode = MIN/MAX/ADD/SHR
use_imm/imm：立即数相关

// Loop offset
acc_idx_T dst_offset_out = 0;
inp_idx_T src_offset_out = 0;
// Outer Loop
EXE_OUT_LOOP: for (int it_out = 0; it_out < insn.iter_out; it_out++) {acc_idx_T dst_offset_in = dst_offset_out;inp_idx_T src_offset_in = src_offset_out;// Inner LoopEXE_IN_LOOP: for (int it_in = 0; it_in < insn.iter_in; it_in++) {// Iterate over micro opREAD_ALU_UOP: for (int upc = insn.uop_bgn; upc < insn.uop_end; upc++) {……}// Update offsetsdst_offset_in += insn.dst_factor_in;src_offset_in += insn.src_factor_in;}// Update offsetsdst_offset_out += insn.dst_factor_out;src_offset_out += insn.src_factor_out;
}

BLOCKED GEMM TEST

对源代码3rdparty/vta-hw/hardware/xilinx/src/vta.cc进行仿真，仿真文件为3rdparty/vta-hw/hardware/xilinx/sim/vta_test.cc。

将数据从DRAM加载到SRAM

在这里插入图片描述

LOAD有两个变体：load_2d和load_pad_2d

load_2d主要用于将weight数据从DRAM搬移到SRAM。

template <typename DATA_T, int MAT_AXI_RATIO, int ELEM_BYTES>
void load_2d(……) {
#pragma HLS INLINEfor (int y = 0; y < y_size; y++) {memcpy(&dst[sram_idx][0],(const DATA_T*) &src[dram_idx * MAT_AXI_RATIO],x_size * ELEM_BYTES);
// 强制将该变量的乘法运算映射到查找表(LUT)实现的乘法器，而不是默认的DSP(数字信号处理器)块
#pragma HLS RESOURCE variable = sram_idx core = Mul_LUTsram_idx += x_size;dram_idx += x_stride;}
}

load_pad_2d可以在数据搬移过程中根据指令中PAD字段的要求做动态填充，适用于将inputs和biases数据从DRAM搬移到SRAM。

template <typename DATA_T, int MAT_AXI_RATIO, int ELEM_BYTES>
void load_pad_2d(……) {
#pragma HLS INLINE// 顶部填充reset_mem<DATA_T, MAT_AXI_RATIO>(sram_idx, y_offset_0, dst);for (int y = 0; y < y_size; y++) {
#pragma HLS PIPELINE// 左侧填充reset_mem<DATA_T, MAT_AXI_RATIO>(sram_idx, x_pad_0, dst);// 数据量以字节为单位，为x_size * ELEM_BYTES个字节memcpy(&dst[sram_idx][0],(const DATA_T*) &src[dram_idx * MAT_AXI_RATIO],x_size * ELEM_BYTES);sram_idx += x_size;dram_idx += x_stride;// 右侧填充reset_mem<DATA_T, MAT_AXI_RATIO>(sram_idx, x_pad_1, dst);}// 底部填充reset_mem<DATA_T, MAT_AXI_RATIO>(sram_idx, y_offset_1, dst);
}

1.将inputs数据从DRAM搬移到INPUT BUFFER

在这里插入图片描述

MAT_AXI_RATIO=2 ELEM_BYTES=16
y_size=64：取block行数为64
x_size=4：取block列数为4
x_stride=16：DRAM数据跨行访存的block列数为16

第1次循环从dram_base=0的地址搬运4×16=64byte的数据
即block1~block4 --> inp_mem[0]~inp_mem[3]

第2次循环从dram_base=16×2=32的地址搬运64byte的数据
即block17~block20 --> inp_mem[4]~inp_mem[7]
……
第64次循环从dram_base=16×63×2=2016的地址搬运64byte的数据
即block1009~block1012 --> inp_mem[252]~inp_mem[255]

2.将weights数据从DRAM搬移到WEIGHT BUFFER

在这里插入图片描述

MAT_AXI_RATIO=32 ELEM_BYTES=256
y_size=4：取block行数为4
x_size=4：取block列数为4
x_stride=16：DRAM数据跨行访存的block列数为16

第1次循环从dram_base=0的地址搬运4×256=1024byte的数据
即block1~block4 --> wgt_mem[0]~wgt_mem[3]

第2次循环从dram_base=16×32=512的地址搬运1024byte的数据
即block17~block20 --> wgt_mem[4]~wgt_mem[7]
……
第64次循环从dram_base=16×3×32=1536的地址搬运1024byte的数据
即block49~block52 --> wgt_mem[12]~wgt_mem[15]

3.将biases数据从DRAM搬移到REGISTER FILE

在这里插入图片描述

MAT_AXI_RATIO=8 ELEM_BYTES=64
y_size=64：取block行数为64
x_size=4：取block列数为4
x_stride=16：DRAM数据跨行访存的block列数为16

第1次循环从dram_base=0的地址搬运4×64=256byte的数据
即block1~block4 --> acc_mem[0]~acc_mem[3]

第2次循环从dram_base=16×8=128的地址搬运256byte的数据
即block17~block20 --> acc_mem[4]~acc_mem[7]
……
第64次循环从dram_base=16×63×8=2016的地址搬运256byte的数据
即block1009~block1012 --> acc_mem[252]~acc_mem[255]

GEMM运算

在这里插入图片描述

将SRAM中的数据写入张量寄存器，每次写入一个block的数据

// Read in weight tensor
wgt_T w_tensor[VTA_BLOCK_OUT][VTA_BLOCK_IN];
read_tensor<bus_T, wgt_T, wgt_idx_T, VTA_BUS_WIDTH, VTA_WGT_WIDTH, VTA_BLOCK_OUT, VTA_BLOCK_IN>(wgt_idx, wgt_mem, w_tensor);
// Read in input tensor
inp_T i_tensor[VTA_BATCH][VTA_BLOCK_IN];
read_tensor<bus_T, inp_T, inp_idx_T, VTA_BUS_WIDTH, VTA_INP_WIDTH, VTA_BATCH, VTA_BLOCK_IN>(src_idx, inp_mem, i_tensor);
// Read in accum tensor
acc_T a_tensor[VTA_BATCH][VTA_BLOCK_OUT];
read_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, acc_mem, a_tensor);
// Output tensor
out_T o_tensor[VTA_BATCH][VTA_BLOCK_OUT];

从SRAM取数据的地址由微操作内核以及指令的地址偏移字段决定

// Decode indices
acc_idx_T dst_idx = uop.range(VTA_UOP_GEM_0_1, VTA_UOP_GEM_0_0) + dst_offset_in;
inp_idx_T src_idx = uop.range(VTA_UOP_GEM_1_1, VTA_UOP_GEM_1_0) + src_offset_in;
wgt_idx_T wgt_idx = uop.range(VTA_UOP_GEM_2_1, VTA_UOP_GEM_2_0) + wgt_offset_in;

对张量寄存器中的数据进行gemm运算

// Inner GEMM loop
for (int b = 0; b < VTA_BATCH; b++) {for (int oc = 0; oc < VTA_BLOCK_OUT; oc++) {// Initialize the accumulator valuesacc_T accum = a_tensor[b][oc];// Dot product sumsum_T tmp = 0;// Inner matrix multiplication loop (input channel/feature)for (int ic = 0; ic < VTA_BLOCK_IN; ic++) {wgt_T w_elem = w_tensor[oc][ic];inp_T i_elem = i_tensor[b][ic];mul_T prod_dsp = i_elem * w_elem;tmp += (sum_T) prod_dsp;}// Update summationaccum += (acc_T) tmp;// Write back result acc_mema_tensor[b][oc] = insn.reset_reg ? (acc_T) 0 : accum;// And output vectoro_tensor[b][oc] = (out_T) accum.range(VTA_OUT_WIDTH - 1, 0);}
}

下图很直观地展示了上述代码的计算过程：

在这里插入图片描述

完成一次gemm loop就将结果写回SRAM

// Write the results back into accumulator
write_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, a_tensor, acc_mem);
// Write the results back in the output buffer
write_tensor<bus_T, out_T, acc_idx_T, VTA_BUS_WIDTH, VTA_OUT_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, o_tensor, out_mem);

gemm loop的最小运算单元是block，一次load会加载许多block，不同类型数据之间的inner loop运算如下图所示

在这里插入图片描述

具体的block卷积流程由MICRO-OP CACHE的src_idx、wgt_idx、dst_idx字段控制

input×weight+bias
1×1+1 1×17+2 1×33+3 1×49+4
2×2+1 2×18+2 2×34+3 2×50+4
……
4×4+1 4×20+2 4×36+3 4×52+4
17×1+17 17×17+18 17×33+19 17×49+20
……
1009×1+1009 …… 1009×49+1012
……
1012×4+1009 …… 1012×52+1012

输出校验

在测试文件中，对三种类型的数据做随机数填充，并将卷积之后的结果作为参考输出

在这里插入图片描述

三种类型的数据都被分成16块，上述GEMM运算是其中1块的运算过程，全部块的outer loop运算如下图所示

在这里插入图片描述

outer loop的卷积运算由LOAD指令的DRAM BASE字段控制

input×weight+bias
1×1+1 2×2+1 3×3+1 4×4+1
1×5+2 2×6+2 3×7+2 4×8+2
……
1×13+4 1×14+4 1×15+4 1×16+4
5×1+5 6×2+5 7×3+5 8×4+5
……
13×1+13 14×2+13 15×3+13 16×4+13
……
13×13+16 14×14+16 15×15+16 16×16+16

最后运算得到的结果会与参考输出一一对比，检验运算结果是否正确

// Correctness checks
int err = 0;
for (int i = 0; i < batch; i++) {for (int j = 0; j < out_feat; j++) {if (outputs_ref[i][j] != outputs[i][j]) {err++;}}
}

块的outer loop运算如下图所示__

[外链图片转存中…(img-GZto5y3g-1754745892326)]

outer loop的卷积运算由LOAD指令的DRAM BASE字段控制

最后运算得到的结果会与参考输出一一对比，检验运算结果是否正确

// Correctness checks
int err = 0;
for (int i = 0; i < batch; i++) {for (int j = 0; j < out_feat; j++) {if (outputs_ref[i][j] != outputs[i][j]) {err++;}}
}

查看全文

http://www.dtcms.com/a/325067.html