当前位置: 首页 > news >正文

【CUDA 】核函数性能分析工具

CUDA C编程笔记

  • nvprof命令
  • nsys命令(linux)
  • ncu命令(linux)
  • nsight compute软件(windows)
  • nsight system软件(windows)

最近因为有事很久没有学习cuda代码了,关于性能分析的命令也已经忘的差不多了,今天用休息时间来把这方面内容补上,方便以后忘记查看。

nvprof命令

报错。
高于8.0的计算能力不能使用,

nvprof is not supported on devices with compute capability 8.0 and higher.                   
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.

nsys命令(linux)

nsys profile --stats=true ./5-3reduceIntege

结果:

~/cudaC/unit5$ nsys profile --stats=true ./5-3reduceIntege
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.Collecting data...
/home/zyn/cudaC/unit5/./5-3reduceIntege starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 16777216 grid 131072 block 128
cpu reduce         : 2139353471
reduceNeighboredGmem: 2139353471 <<<grid 131072 block 128>>>
Generating '/tmp/nsys-report-09fe.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/zyn/cudaC/unit5/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats reportTime (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)     StdDev (ns)            Name         --------  ---------------  ---------  -------------  -------------  -----------  -----------  -------------  ----------------------63.8    1,537,834,150         92   16,715,588.6   10,119,573.0        1,146  262,165,057   33,433,226.8  poll                  25.9      623,869,881          2  311,934,940.5  311,934,940.5  123,751,122  500,118,759  266,132,108.3  pthread_cond_timedwait10.0      240,457,182        649      370,504.1       22,601.0        1,023   22,438,180    1,086,289.3  ioctl                 0.1        2,832,914         48       59,019.0        8,945.5        5,014      951,482      187,238.0  mmap64                0.1        1,342,580         18       74,587.8       96,389.5       13,299      153,732       47,889.8  sem_timedwait         0.0          818,329         18       45,462.7       10,248.0        1,957      202,878       72,523.7  mmap                  0.0          648,770          2      324,385.0      324,385.0      255,162      393,608       97,896.1  pthread_join          0.0          270,074         12       22,506.2        6,589.0        2,312      197,373       55,204.8  munmap                0.0          234,465          1      234,465.0      234,465.0      234,465      234,465            0.0  pthread_cond_wait     0.0          222,957          3       74,319.0       72,590.0       51,931       98,436       23,300.7  pthread_create        0.0          217,384         43        5,055.4        4,374.0        1,463       13,918        2,673.3  open64                0.0          161,246         34        4,742.5        2,862.0        1,125       27,307        5,351.4  fopen                 0.0           50,583         20        2,529.2        2,806.5        1,026        3,555          868.4  read                  0.0           42,498         20        2,124.9        1,703.0        1,105        5,659        1,155.0  write                 0.0           36,204          1       36,204.0       36,204.0       36,204       36,204            0.0  fgets                 0.0           32,465         20        1,623.3        1,234.5        1,005        4,845          945.2  fclose                0.0           22,702          2       11,351.0       11,351.0        5,941       16,761        7,650.9  fread                 0.0           21,733          6        3,622.2        4,046.5        1,314        6,126        1,777.9  open                  0.0           17,492          3        5,830.7        6,707.0        2,558        8,227        2,934.3  pipe2                 0.0           17,054          2        8,527.0        8,527.0        5,753       11,301        3,923.0  socket                0.0           15,133          3        5,044.3        3,915.0        3,648        7,570        2,191.4  pthread_cond_broadcast0.0            9,767          1        9,767.0        9,767.0        9,767        9,767            0.0  connect               0.0            6,405          1        6,405.0        6,405.0        6,405        6,405            0.0  pthread_kill          0.0            3,596          2        1,798.0        1,798.0        1,653        1,943          205.1  fwrite                0.0            3,048          1        3,048.0        3,048.0        3,048        3,048            0.0  bind                  0.0            2,277          1        2,277.0        2,277.0        2,277        2,277            0.0  pthread_cond_signal   0.0            1,610          1        1,610.0        1,610.0        1,610        1,610            0.0  fcntl                 [5/8] Executing 'cuda_api_sum' stats reportTime (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                 Name               --------  ---------------  ---------  ------------  ------------  ----------  ----------  ------------  ---------------------------------69.2       70,851,810          1  70,851,810.0  70,851,810.0  70,851,810  70,851,810           0.0  cudaDeviceReset                  24.1       24,712,531          2  12,356,265.5  12,356,265.5   3,096,192  21,616,339  13,095,721.5  cudaMemcpy                       3.3        3,399,421          1   3,399,421.0   3,399,421.0   3,399,421   3,399,421           0.0  cudaGetDeviceProperties_v2_v120001.8        1,842,364          1   1,842,364.0   1,842,364.0   1,842,364   1,842,364           0.0  cudaLaunchKernel                 1.3        1,284,642          2     642,321.0     642,321.0     186,268   1,098,374     644,956.3  cudaFree                         0.3          291,849          2     145,924.5     145,924.5      68,123     223,726     110,027.9  cudaMalloc                       0.0            3,185          1       3,185.0       3,185.0       3,185       3,185           0.0  cuCtxSynchronize                 0.0            1,603          1       1,603.0       1,603.0       1,603       1,603           0.0  cuModuleGetLoadingMode           [6/8] Executing 'cuda_gpu_kern_sum' stats reportTime (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                   Name                 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------------------------------100.0          231,714          1  231,714.0  231,714.0   231,714   231,714          0.0  reduceGmem(int *, int *, unsigned int)[7/8] Executing 'cuda_gpu_mem_time_sum' stats reportTime (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Operation          --------  ---------------  -----  ------------  ------------  ----------  ----------  -----------  ----------------------------99.0       21,807,788      1  21,807,788.0  21,807,788.0  21,807,788  21,807,788          0.0  [CUDA memcpy Host-to-Device]1.0          217,121      1     217,121.0     217,121.0     217,121     217,121          0.0  [CUDA memcpy Device-to-Host][8/8] Executing 'cuda_gpu_mem_size_sum' stats reportTotal (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          ----------  -----  --------  --------  --------  --------  -----------  ----------------------------67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memcpy Host-to-Device]0.524      1     0.524     0.524     0.524     0.524        0.000  [CUDA memcpy Device-to-Host]Generated:/home/zyn/cudaC/unit5/report2.nsys-rep/home/zyn/cudaC/unit5/report2.sqlite

发现:
①每使用一次nsys profile就会新生成两个文件report2.nsys-rep和report2.sqlite,并且序号还会自动往后延。
②核函数本身计算的时间并不多,更多在设备reset和数据传输上面。

ncu命令(linux)

命令使用方法:

  1. 基础分析(无指标) ncu ./5-3reduceInteger
  2. 指定指标分析(指标名称在前) ncu --metrics gld_efficiency,gst_efficiency ./5-3reduceInteger
  3. 分析特定核函数 ncu --kernel-name reduceGmem ./5-3reduceInteger
  4. 收集所有指标(不推荐,输出巨大) ncu --metrics all ./5-3reduceInteger

结果:

~/cudaC/unit5$ ncu ./5-3reduceInteger
==PROF== Connected to process 2380393 (/home/zyn/cudaC/unit5/5-3reduceInteger)
/home/zyn/cudaC/unit5/./5-3reduceInteger starting reduction atdevice 0: NVIDIA GeForce RTX 3090 with array size 4194304 grid 32768 block 128
cpu reduce         : 534907410
==PROF== Profiling "reduceGmem" - 0: 0%....50%....100% - 8 passes
reduceNeighboredGmem: 534907410 <<<grid 32768 block 128>>>
==PROF== Disconnected from process 2380393
[2380393] 5-3reduceInteger@127.0.0.1reduceGmem(int *, int *, unsigned int) (32768, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: GPU Speed Of Light Throughput----------------------- ----------- ------------Metric Name             Metric Unit Metric Value----------------------- ----------- ------------DRAM Frequency                  Ghz         9.49SM Frequency                    Ghz         1.39Elapsed Cycles                cycle      106,926Memory Throughput                 %        66.85DRAM Throughput                   %        36.09Duration                         us        76.80L1/TEX Cache Throughput           %        35.68L2 Cache Throughput               %        66.85SM Active Cycles              cycle   105,584.51Compute (SM) Throughput           %        22.44----------------------- ----------- ------------OPT   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        whether there are values you can (re)compute.                                                                 Section: Launch Statistics-------------------------------- --------------- ---------------Metric Name                          Metric Unit    Metric Value-------------------------------- --------------- ---------------Block Size                                                   128Function Cache Configuration                     CachePreferNoneGrid Size                                                 32,768Registers Per Thread             register/thread              18Shared Memory Configuration Size           Kbyte           16.38Driver Shared Memory Per Block       Kbyte/block            1.02Dynamic Shared Memory Per Block       byte/block               0Static Shared Memory Per Block        byte/block               0# SMs                                         SM              82Stack Size                                                 1,024Threads                                   thread       4,194,304# TPCs                                                        41Enabled TPC IDs                                              allUses Green Context                                             0Waves Per SM                                               33.30-------------------------------- --------------- ---------------Section: Occupancy------------------------------- ----------- ------------Metric Name                     Metric Unit Metric Value------------------------------- ----------- ------------Block Limit SM                        block           16Block Limit Registers                 block           21Block Limit Shared Mem                block           16Block Limit Warps                     block           12Theoretical Active Warps per SM        warp           48Theoretical Occupancy                     %          100Achieved Occupancy                        %        58.66Achieved Active Warps Per SM           warp        28.16------------------------------- ----------- ------------OPT   Est. Local Speedup: 41.34%                                                                                    The difference between calculated theoretical (100.0%) and measured achieved occupancy (58.7%) can be the     result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     optimizing occupancy.                                                                                         Section: GPU and Memory Workload Distribution-------------------------- ----------- ------------Metric Name                Metric Unit Metric Value-------------------------- ----------- ------------Average DRAM Active Cycles       cycle      262,944Total DRAM Elapsed Cycles        cycle    8,743,936Average L1 Active Cycles         cycle   105,584.51Total L1 Elapsed Cycles          cycle    8,760,638Average L2 Active Cycles         cycle   102,636.31Total L2 Elapsed Cycles          cycle    5,031,936Average SM Active Cycles         cycle   105,584.51Total SM Elapsed Cycles          cycle    8,760,638Average SMSP Active Cycles       cycle   103,461.77Total SMSP Elapsed Cycles        cycle   35,042,552-------------------------- ----------- ------------

nsight compute软件(windows)

远程ssh链接后选择即可

nsight system软件(windows)

目前未使用过,待补充

相关文章:

  • PLC入门【2】PLC的接线
  • 系统模块与功能设计框架
  • 【Java学习笔记】System类
  • Linux下的进程调度机制
  • 深入理解 Java 的反射、注解与动态代理
  • 2025新高考二卷选择题第一题题解
  • LangChain4j 1.x 核心源码剖析-基础篇
  • 【项目实训项目博客】用户使用手册
  • Secs/Gem第十二讲(基于secs4net项目的ChatGpt介绍)
  • 【第四十八周】HippoRAG 2 复现与分析(二):索引阶段代码分析
  • 嵌入式学习笔记 - freeRTOS为什么中断中不能使用互斥量
  • 使用自定义模板的方式批量切割图片。
  • 从零开始编写Mcp Server,发布上线,超简单极速入门
  • Amazon RDS on AWS Outposts:解锁本地化云数据库的混合云新体验
  • 循环神经网络(RNN):从理论到翻译
  • 标准代码项目开发流程学习指南
  • leetcode Top100 238. 除自身以外数组的乘积|数组系列
  • 【Ragflow】27.RagflowPlus(v0.4.1):小版本迭代,问题修复与功能优化
  • 高保真组件库:单选复选
  • 为什么需要Redis分布式锁?在哪些微服务场景下会用到?