Lecture 44: NVIDIA Profiling (未完)
缘起
一个CUDA 大佬分享给我一个视频,花点时间消化一下.
视频大纲总结
- Introduction & Format:* The talk will focus on using NVIDIA Insight Compute for kernel profiling. Magnus, who has 15 years of experience with profiling tools, will lead the session, showing mostly live profiling traces after a brief slide overview.
- Tool Overview:* Jackson provides a quick overview of NVIDIA’s developer tools, including debuggers (CUDA GDB), profilers (Nsight Systems, Nsight Compute), and correctness checkers (Compute Sanitizer).
- Profiling Workflow:* The recommended workflow starts with Nsight Systems for a high-level platform overview, then moves to Nsight Compute for detailed kernel analysis.
- Nsight Systems:* Nsight Systems is a timeline-based tool for system-level profiling, showing CPU-GPU interactions, memory transfers, and some GPU-wide metrics.
- Nsight Compute Introduction (Magnus)😗 Nsight Compute is a low-level kernel profiler to gather as much data as possible about a specific kernel or small set of kernels for detailed optimizations. The data includes source code, and assembly line level analysis.
- Tool Operation:* Nsight Compute works as a launcher, intercepting interactions between the application and the CUDA driver. No application modification is needed, but optional annotations (nvtx) and compiler flags (line info) can enhance profiling.
- Overhead and Design:* There’s some overhead due to interception, but the focus is on minimizing impact on kernel execution during profiling. Replay passes are used to collect more data than can be gathered in a single run, given hardware limitations.
- Profiling Collection and Replay:* Nsight Compute replays the kernel multiple times to collect different metrics on each pass, enabling comprehensive data gathering. The process includes saving/restoring GPU state and optionally locking clocks to ensure consistency.
- Context Saving Details:* Memory that the kernel can access is copied. Ideally, this copy is done on the GPU (device-to-device), but can fall back to system memory or even disk if needed.
- Live Demo (Simple Vector Add)😗 Magnus demonstrates the tool using a simple vector addition kernel to show expected results and basic features.
- Summary Page😗 The summary page shows an overview, including duration, throughput, register usage, grid size, and rule-based recommendations.
- Details Page😗 This page provides a detailed breakdown of metrics, organized into sections (e.g., Speed of Light, Memory Workload Analysis, Scheduler Statistics, Warp State Statistics). Each section contains tables, charts, and rule-based analysis.
- Memory Chart😗 Visualizes the memory hierarchy, showing data flow and bottlenecks.
- Bottleneck Table😗 Shows the sorted top bottlenecks, helping to understand the impact of fixing an issue.
- Scheduler Statistics😗 Shows warp scheduler utilization, active/eligible warps, and stall reasons.
- Warp State Statistics😗 Provides a breakdown of stall reasons (e.g., long scoreboard for memory dependencies).
- Instruction Statistics😗 Displays a histogram of low-level assembly instructions.
- Occupancy😗 Shows the standard occupancy chart and calculator.
- Source Page😗 Correlates high-level code (Cuda C or Python with Numba) with low-level assembly instructions, displaying metrics at the source line and assembly line levels. Shows assembly dependencies.
- Live Demo (Image Sharpening with Numba)😗 A more complex example using a Numba-based image sharpening kernel is profiled.
- Rule Output😗 Rule system suggestions are displayed.
- Diffing Reports:* The tool allows comparing reports (before and after code changes) to see the impact of optimizations, with visual cues and percentage changes.
- Live Demo (Tensor Core Operation)😗 A kernel using tensor core operations on a GB100 (Blackwell) is profiled.
- Tensor Core Roofline😗 Shows the specific tensor core operations used and their performance relative to peak.
- TMA Engine😗 TMA is used to copy from device to shared memory.
- Python Report Interface😗 Nsight Compute provides a Python API for accessing the report data, allowing custom post-processing and analysis.
- Extensibility:* Sections and rules within Nsight Compute are defined in text files and can be modified or extended by users.
- Additional Resources:* Links to documentation, forums, and training videos are provided.
核心Slide 总结 (github 没有提供PPT)
好奇 & 待解决的疑问
1. 如何在库里面支持trace? (cuDNN/cuBlas)
2. 单次 与多次 profile 有什么区别?
- 见图5 解释,多次采样原因, 一个合理解释:再最小化观测GPU性能时,不能抽取太多的PMU,影响实际的运行情况,所以要多次运行抽取PMU, 不同的pass收集不同的性能指标数据.
- 单一路径指标是啥?
3. 为什么要降低采样频率? 在大尺寸计算密集型Kernel ?
4. 为什么可以截获应用运行的情况? NVTX 插桩技术.
5. 什么是主机拦截开销? (host interception)
6. 如何在收集CPU端数据时,不影响GPU性能?
7. 如何确定bound在CPU?
8. 如何关联程序语言/汇编器/ 性能参数的影响? (-linetable/lineinfo)
9.图6 的Replay Pass 具体是怎样的?
- 数据量很大的时候,存 GPU内存 or CPU内存? (可否设置/)