当前位置：首页 > news >正文

【CUDA 】核函数性能分析工具

news 2025/7/15 14:06:55

CUDA C编程笔记

nvprof命令
nsys命令（linux）
ncu命令（linux）
nsight compute软件（windows）
nsight system软件（windows）

最近因为有事很久没有学习cuda代码了，关于性能分析的命令也已经忘的差不多了，今天用休息时间来把这方面内容补上，方便以后忘记查看。

nvprof命令

报错。
高于8.0的计算能力不能使用，

nvprof is not supported on devices with compute capability 8.0 and higher.                   
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.

nsys命令（linux）

nsys profile --stats=true ./5-3reduceIntege

结果：

~/cudaC/unit5$ nsys profile --stats=true ./5-3reduceIntege
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.Collecting data...
/home/zyn/cudaC/unit5/./5-3reduceIntege starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 16777216 grid 131072 block 128
cpu reduce         : 2139353471
reduceNeighboredGmem: 2139353471 <<<grid 131072 block 128>>>
Generating '/tmp/nsys-report-09fe.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/zyn/cudaC/unit5/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats reportTime (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)     StdDev (ns)            Name         --------  ---------------  ---------  -------------  -------------  -----------  -----------  -------------  ----------------------63.8    1,537,834,150         92   16,715,588.6   10,119,573.0        1,146  262,165,057   33,433,226.8  poll                  25.9      623,869,881          2  311,934,940.5  311,934,940.5  123,751,122  500,118,759  266,132,108.3  pthread_cond_timedwait10.0      240,457,182        649      370,504.1       22,601.0        1,023   22,438,180    1,086,289.3  ioctl                 0.1        2,832,914         48       59,019.0        8,945.5        5,014      951,482      187,238.0  mmap64                0.1        1,342,580         18       74,587.8       96,389.5       13,299      153,732       47,889.8  sem_timedwait         0.0          818,329         18       45,462.7       10,248.0        1,957      202,878       72,523.7  mmap                  0.0          648,770          2      324,385.0      324,385.0      255,162      393,608       97,896.1  pthread_join          0.0          270,074         12       22,506.2        6,589.0        2,312      197,373       55,204.8  munmap                0.0          234,465          1      234,465.0      234,465.0      234,465      234,465            0.0  pthread_cond_wait     0.0          222,957          3       74,319.0       72,590.0       51,931       98,436       23,300.7  pthread_create        0.0          217,384         43        5,055.4        4,374.0        1,463       13,918        2,673.3  open64                0.0          161,246         34        4,742.5        2,862.0        1,125       27,307        5,351.4  fopen                 0.0           50,583         20        2,529.2        2,806.5        1,026        3,555          868.4  read                  0.0           42,498         20        2,124.9        1,703.0        1,105        5,659        1,155.0  write                 0.0           36,204          1       36,204.0       36,204.0       36,204       36,204            0.0  fgets                 0.0           32,465         20        1,623.3        1,234.5        1,005        4,845          945.2  fclose                0.0           22,702          2       11,351.0       11,351.0        5,941       16,761        7,650.9  fread                 0.0           21,733          6        3,622.2        4,046.5        1,314        6,126        1,777.9  open                  0.0           17,492          3        5,830.7        6,707.0        2,558        8,227        2,934.3  pipe2                 0.0           17,054          2        8,527.0        8,527.0        5,753       11,301        3,923.0  socket                0.0           15,133          3        5,044.3        3,915.0        3,648        7,570        2,191.4  pthread_cond_broadcast0.0            9,767          1        9,767.0        9,767.0        9,767        9,767            0.0  connect               0.0            6,405          1        6,405.0        6,405.0        6,405        6,405            0.0  pthread_kill          0.0            3,596          2        1,798.0        1,798.0        1,653        1,943          205.1  fwrite                0.0            3,048          1        3,048.0        3,048.0        3,048        3,048            0.0  bind                  0.0            2,277          1        2,277.0        2,277.0        2,277        2,277            0.0  pthread_cond_signal   0.0            1,610          1        1,610.0        1,610.0        1,610        1,610            0.0  fcntl                 [5/8] Executing 'cuda_api_sum' stats reportTime (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                 Name               --------  ---------------  ---------  ------------  ------------  ----------  ----------  ------------  ---------------------------------69.2       70,851,810          1  70,851,810.0  70,851,810.0  70,851,810  70,851,810           0.0  cudaDeviceReset                  24.1       24,712,531          2  12,356,265.5  12,356,265.5   3,096,192  21,616,339  13,095,721.5  cudaMemcpy                       3.3        3,399,421          1   3,399,421.0   3,399,421.0   3,399,421   3,399,421           0.0  cudaGetDeviceProperties_v2_v120001.8        1,842,364          1   1,842,364.0   1,842,364.0   1,842,364   1,842,364           0.0  cudaLaunchKernel                 1.3        1,284,642          2     642,321.0     642,321.0     186,268   1,098,374     644,956.3  cudaFree                         0.3          291,849          2     145,924.5     145,924.5      68,123     223,726     110,027.9  cudaMalloc                       0.0            3,185          1       3,185.0       3,185.0       3,185       3,185           0.0  cuCtxSynchronize                 0.0            1,603          1       1,603.0       1,603.0       1,603       1,603           0.0  cuModuleGetLoadingMode           [6/8] Executing 'cuda_gpu_kern_sum' stats reportTime (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                   Name                 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------------------------------100.0          231,714          1  231,714.0  231,714.0   231,714   231,714          0.0  reduceGmem(int *, int *, unsigned int)[7/8] Executing 'cuda_gpu_mem_time_sum' stats reportTime (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Operation          --------  ---------------  -----  ------------  ------------  ----------  ----------  -----------  ----------------------------99.0       21,807,788      1  21,807,788.0  21,807,788.0  21,807,788  21,807,788          0.0  [CUDA memcpy Host-to-Device]1.0          217,121      1     217,121.0     217,121.0     217,121     217,121          0.0  [CUDA memcpy Device-to-Host][8/8] Executing 'cuda_gpu_mem_size_sum' stats reportTotal (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          ----------  -----  --------  --------  --------  --------  -----------  ----------------------------67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memcpy Host-to-Device]0.524      1     0.524     0.524     0.524     0.524        0.000  [CUDA memcpy Device-to-Host]Generated:/home/zyn/cudaC/unit5/report2.nsys-rep/home/zyn/cudaC/unit5/report2.sqlite

发现：
①每使用一次nsys profile就会新生成两个文件report2.nsys-rep和report2.sqlite，并且序号还会自动往后延。
②核函数本身计算的时间并不多，更多在设备reset和数据传输上面。

ncu命令（linux）

命令使用方法：

基础分析（无指标） ncu ./5-3reduceInteger
指定指标分析（指标名称在前） ncu --metrics gld_efficiency,gst_efficiency ./5-3reduceInteger
分析特定核函数 ncu --kernel-name reduceGmem ./5-3reduceInteger
收集所有指标（不推荐，输出巨大） ncu --metrics all ./5-3reduceInteger

结果：

~/cudaC/unit5$ ncu ./5-3reduceInteger
==PROF== Connected to process 2380393 (/home/zyn/cudaC/unit5/5-3reduceInteger)
/home/zyn/cudaC/unit5/./5-3reduceInteger starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 4194304 grid 32768 block 128
cpu reduce         : 534907410
==PROF== Profiling "reduceGmem" - 0: 0%....50%....100% - 8 passes
reduceNeighboredGmem: 534907410 <<<grid 32768 block 128>>>
==PROF== Disconnected from process 2380393
[2380393] 5-3reduceInteger@127.0.0.1reduceGmem(int *, int *, unsigned int) (32768, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: GPU Speed Of Light Throughput----------------------- ----------- ------------Metric Name             Metric Unit Metric Value----------------------- ----------- ------------DRAM Frequency                  Ghz         9.49SM Frequency                    Ghz         1.39Elapsed Cycles                cycle      106,926Memory Throughput                 %        66.85DRAM Throughput                   %        36.09Duration                         us        76.80L1/TEX Cache Throughput           %        35.68L2 Cache Throughput               %        66.85SM Active Cycles              cycle   105,584.51Compute (SM) Throughput           %        22.44----------------------- ----------- ------------OPT   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        whether there are values you can (re)compute.                                                                 Section: Launch Statistics-------------------------------- --------------- ---------------Metric Name                          Metric Unit    Metric Value-------------------------------- --------------- ---------------Block Size                                                   128Function Cache Configuration                     CachePreferNoneGrid Size                                                 32,768Registers Per Thread             register/thread              18Shared Memory Configuration Size           Kbyte           16.38Driver Shared Memory Per Block       Kbyte/block            1.02Dynamic Shared Memory Per Block       byte/block               0Static Shared Memory Per Block        byte/block               0# SMs                                         SM              82Stack Size                                                 1,024Threads                                   thread       4,194,304# TPCs                                                        41Enabled TPC IDs                                              allUses Green Context                                             0Waves Per SM                                               33.30-------------------------------- --------------- ---------------Section: Occupancy------------------------------- ----------- ------------Metric Name                     Metric Unit Metric Value------------------------------- ----------- ------------Block Limit SM                        block           16Block Limit Registers                 block           21Block Limit Shared Mem                block           16Block Limit Warps                     block           12Theoretical Active Warps per SM        warp           48Theoretical Occupancy                     %          100Achieved Occupancy                        %        58.66Achieved Active Warps Per SM           warp        28.16------------------------------- ----------- ------------OPT   Est. Local Speedup: 41.34%                                                                                    The difference between calculated theoretical (100.0%) and measured achieved occupancy (58.7%) can be the     result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     optimizing occupancy.                                                                                         Section: GPU and Memory Workload Distribution-------------------------- ----------- ------------Metric Name                Metric Unit Metric Value-------------------------- ----------- ------------Average DRAM Active Cycles       cycle      262,944Total DRAM Elapsed Cycles        cycle    8,743,936Average L1 Active Cycles         cycle   105,584.51Total L1 Elapsed Cycles          cycle    8,760,638Average L2 Active Cycles         cycle   102,636.31Total L2 Elapsed Cycles          cycle    5,031,936Average SM Active Cycles         cycle   105,584.51Total SM Elapsed Cycles          cycle    8,760,638Average SMSP Active Cycles       cycle   103,461.77Total SMSP Elapsed Cycles        cycle   35,042,552-------------------------- ----------- ------------

nsight compute软件（windows）

远程ssh链接后选择即可

nsight system软件（windows）

目前未使用过，待补充

查看全文

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.dtcms.com/a/239235.html 如若内容造成侵权/违法违规/事实不符，请联系邮箱：809451989@qq.com进行投诉反馈，一经查实，立即删除！