NV 工具metrics分析(ncu, nsys/torch profiler)
以下分析都以A100硬件架构为例;
Theoretical Max Active Warps per SM: 64
Register number: 512 (规定每个thread不能超过256)
-
Theoretical Active Warps per SM [warp]:512//registers_per_thread*4, which defines theoretical active warp occupancy
-
Waves Per SM(equals waves per GPU):grid_size/a_wave_perf_GPU,which defines tail effect
-
a_wave_perf_GPU:Theoretical Active Warps per SM // (block_size//32) * 108
-
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU
-
-
ncu/nsys/torch_profiler 计算threadBlock register file使用情况(threadBlock share mem使用情况也有显示):
registers_mem_used = blockSize x registers_per_thread
hopper 白皮书中文版本