【CUDA 】核函数性能分析工具
CUDA C编程笔记
- nvprof命令
- nsys命令(linux)
- ncu命令(linux)
- nsight compute软件(windows)
- nsight system软件(windows)
最近因为有事很久没有学习cuda代码了,关于性能分析的命令也已经忘的差不多了,今天用休息时间来把这方面内容补上,方便以后忘记查看。
nvprof命令
报错。
高于8.0的计算能力不能使用,
nvprof is not supported on devices with compute capability 8.0 and higher.
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
nsys命令(linux)
nsys profile --stats=true ./5-3reduceIntege
结果:
~/cudaC/unit5$ nsys profile --stats=true ./5-3reduceIntege
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.Collecting data...
/home/zyn/cudaC/unit5/./5-3reduceIntege starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 16777216 grid 131072 block 128
cpu reduce : 2139353471
reduceNeighboredGmem: 2139353471 <<<grid 131072 block 128>>>
Generating '/tmp/nsys-report-09fe.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/zyn/cudaC/unit5/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats reportTime (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ------------- ----------------------63.8 1,537,834,150 92 16,715,588.6 10,119,573.0 1,146 262,165,057 33,433,226.8 poll 25.9 623,869,881 2 311,934,940.5 311,934,940.5 123,751,122 500,118,759 266,132,108.3 pthread_cond_timedwait10.0 240,457,182 649 370,504.1 22,601.0 1,023 22,438,180 1,086,289.3 ioctl 0.1 2,832,914 48 59,019.0 8,945.5 5,014 951,482 187,238.0 mmap64 0.1 1,342,580 18 74,587.8 96,389.5 13,299 153,732 47,889.8 sem_timedwait 0.0 818,329 18 45,462.7 10,248.0 1,957 202,878 72,523.7 mmap 0.0 648,770 2 324,385.0 324,385.0 255,162 393,608 97,896.1 pthread_join 0.0 270,074 12 22,506.2 6,589.0 2,312 197,373 55,204.8 munmap 0.0 234,465 1 234,465.0 234,465.0 234,465 234,465 0.0 pthread_cond_wait 0.0 222,957 3 74,319.0 72,590.0 51,931 98,436 23,300.7 pthread_create 0.0 217,384 43 5,055.4 4,374.0 1,463 13,918 2,673.3 open64 0.0 161,246 34 4,742.5 2,862.0 1,125 27,307 5,351.4 fopen 0.0 50,583 20 2,529.2 2,806.5 1,026 3,555 868.4 read 0.0 42,498 20 2,124.9 1,703.0 1,105 5,659 1,155.0 write 0.0 36,204 1 36,204.0 36,204.0 36,204 36,204 0.0 fgets 0.0 32,465 20 1,623.3 1,234.5 1,005 4,845 945.2 fclose 0.0 22,702 2 11,351.0 11,351.0 5,941 16,761 7,650.9 fread 0.0 21,733 6 3,622.2 4,046.5 1,314 6,126 1,777.9 open 0.0 17,492 3 5,830.7 6,707.0 2,558 8,227 2,934.3 pipe2 0.0 17,054 2 8,527.0 8,527.0 5,753 11,301 3,923.0 socket 0.0 15,133 3 5,044.3 3,915.0 3,648 7,570 2,191.4 pthread_cond_broadcast0.0 9,767 1 9,767.0 9,767.0 9,767 9,767 0.0 connect 0.0 6,405 1 6,405.0 6,405.0 6,405 6,405 0.0 pthread_kill 0.0 3,596 2 1,798.0 1,798.0 1,653 1,943 205.1 fwrite 0.0 3,048 1 3,048.0 3,048.0 3,048 3,048 0.0 bind 0.0 2,277 1 2,277.0 2,277.0 2,277 2,277 0.0 pthread_cond_signal 0.0 1,610 1 1,610.0 1,610.0 1,610 1,610 0.0 fcntl [5/8] Executing 'cuda_api_sum' stats reportTime (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------ ------------ ---------- ---------- ------------ ---------------------------------69.2 70,851,810 1 70,851,810.0 70,851,810.0 70,851,810 70,851,810 0.0 cudaDeviceReset 24.1 24,712,531 2 12,356,265.5 12,356,265.5 3,096,192 21,616,339 13,095,721.5 cudaMemcpy 3.3 3,399,421 1 3,399,421.0 3,399,421.0 3,399,421 3,399,421 0.0 cudaGetDeviceProperties_v2_v120001.8 1,842,364 1 1,842,364.0 1,842,364.0 1,842,364 1,842,364 0.0 cudaLaunchKernel 1.3 1,284,642 2 642,321.0 642,321.0 186,268 1,098,374 644,956.3 cudaFree 0.3 291,849 2 145,924.5 145,924.5 68,123 223,726 110,027.9 cudaMalloc 0.0 3,185 1 3,185.0 3,185.0 3,185 3,185 0.0 cuCtxSynchronize 0.0 1,603 1 1,603.0 1,603.0 1,603 1,603 0.0 cuModuleGetLoadingMode [6/8] Executing 'cuda_gpu_kern_sum' stats reportTime (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------- --------- -------- -------- ----------- --------------------------------------100.0 231,714 1 231,714.0 231,714.0 231,714 231,714 0.0 reduceGmem(int *, int *, unsigned int)[7/8] Executing 'cuda_gpu_mem_time_sum' stats reportTime (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ----- ------------ ------------ ---------- ---------- ----------- ----------------------------99.0 21,807,788 1 21,807,788.0 21,807,788.0 21,807,788 21,807,788 0.0 [CUDA memcpy Host-to-Device]1.0 217,121 1 217,121.0 217,121.0 217,121 217,121 0.0 [CUDA memcpy Device-to-Host][8/8] Executing 'cuda_gpu_mem_size_sum' stats reportTotal (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ----- -------- -------- -------- -------- ----------- ----------------------------67.109 1 67.109 67.109 67.109 67.109 0.000 [CUDA memcpy Host-to-Device]0.524 1 0.524 0.524 0.524 0.524 0.000 [CUDA memcpy Device-to-Host]Generated:/home/zyn/cudaC/unit5/report2.nsys-rep/home/zyn/cudaC/unit5/report2.sqlite
发现:
①每使用一次nsys profile就会新生成两个文件report2.nsys-rep和report2.sqlite,并且序号还会自动往后延。
②核函数本身计算的时间并不多,更多在设备reset和数据传输上面。
ncu命令(linux)
命令使用方法:
- 基础分析(无指标) ncu ./5-3reduceInteger
- 指定指标分析(指标名称在前) ncu --metrics gld_efficiency,gst_efficiency ./5-3reduceInteger
- 分析特定核函数 ncu --kernel-name reduceGmem ./5-3reduceInteger
- 收集所有指标(不推荐,输出巨大) ncu --metrics all ./5-3reduceInteger
结果:
~/cudaC/unit5$ ncu ./5-3reduceInteger
==PROF== Connected to process 2380393 (/home/zyn/cudaC/unit5/5-3reduceInteger)
/home/zyn/cudaC/unit5/./5-3reduceInteger starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 4194304 grid 32768 block 128
cpu reduce : 534907410
==PROF== Profiling "reduceGmem" - 0: 0%....50%....100% - 8 passes
reduceNeighboredGmem: 534907410 <<<grid 32768 block 128>>>
==PROF== Disconnected from process 2380393
[2380393] 5-3reduceInteger@127.0.0.1reduceGmem(int *, int *, unsigned int) (32768, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: GPU Speed Of Light Throughput----------------------- ----------- ------------Metric Name Metric Unit Metric Value----------------------- ----------- ------------DRAM Frequency Ghz 9.49SM Frequency Ghz 1.39Elapsed Cycles cycle 106,926Memory Throughput % 66.85DRAM Throughput % 36.09Duration us 76.80L1/TEX Cache Throughput % 35.68L2 Cache Throughput % 66.85SM Active Cycles cycle 105,584.51Compute (SM) Throughput % 22.44----------------------- ----------- ------------OPT Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute. Section: Launch Statistics-------------------------------- --------------- ---------------Metric Name Metric Unit Metric Value-------------------------------- --------------- ---------------Block Size 128Function Cache Configuration CachePreferNoneGrid Size 32,768Registers Per Thread register/thread 18Shared Memory Configuration Size Kbyte 16.38Driver Shared Memory Per Block Kbyte/block 1.02Dynamic Shared Memory Per Block byte/block 0Static Shared Memory Per Block byte/block 0# SMs SM 82Stack Size 1,024Threads thread 4,194,304# TPCs 41Enabled TPC IDs allUses Green Context 0Waves Per SM 33.30-------------------------------- --------------- ---------------Section: Occupancy------------------------------- ----------- ------------Metric Name Metric Unit Metric Value------------------------------- ----------- ------------Block Limit SM block 16Block Limit Registers block 21Block Limit Shared Mem block 16Block Limit Warps block 12Theoretical Active Warps per SM warp 48Theoretical Occupancy % 100Achieved Occupancy % 58.66Achieved Active Warps Per SM warp 28.16------------------------------- ----------- ------------OPT Est. Local Speedup: 41.34% The difference between calculated theoretical (100.0%) and measured achieved occupancy (58.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. Section: GPU and Memory Workload Distribution-------------------------- ----------- ------------Metric Name Metric Unit Metric Value-------------------------- ----------- ------------Average DRAM Active Cycles cycle 262,944Total DRAM Elapsed Cycles cycle 8,743,936Average L1 Active Cycles cycle 105,584.51Total L1 Elapsed Cycles cycle 8,760,638Average L2 Active Cycles cycle 102,636.31Total L2 Elapsed Cycles cycle 5,031,936Average SM Active Cycles cycle 105,584.51Total SM Elapsed Cycles cycle 8,760,638Average SMSP Active Cycles cycle 103,461.77Total SMSP Elapsed Cycles cycle 35,042,552-------------------------- ----------- ------------
nsight compute软件(windows)
远程ssh链接后选择即可
nsight system软件(windows)
目前未使用过,待补充