当前位置：首页 > news >正文

Lecture 44: NVIDIA Profiling (未完)

news 来源：原创 2025/6/19 7:33:08

缘起

一个CUDA 大佬分享给我一个视频，花点时间消化一下.

视频大纲总结

Introduction & Format:* The talk will focus on using NVIDIA Insight Compute for kernel profiling. Magnus, who has 15 years of experience with profiling tools, will lead the session, showing mostly live profiling traces after a brief slide overview.
Tool Overview:* Jackson provides a quick overview of NVIDIA’s developer tools, including debuggers (CUDA GDB), profilers (Nsight Systems, Nsight Compute), and correctness checkers (Compute Sanitizer).
Profiling Workflow:* The recommended workflow starts with Nsight Systems for a high-level platform overview, then moves to Nsight Compute for detailed kernel analysis.
Nsight Systems:* Nsight Systems is a timeline-based tool for system-level profiling, showing CPU-GPU interactions, memory transfers, and some GPU-wide metrics.
Nsight Compute Introduction (Magnus)😗 Nsight Compute is a low-level kernel profiler to gather as much data as possible about a specific kernel or small set of kernels for detailed optimizations. The data includes source code, and assembly line level analysis.
Tool Operation:* Nsight Compute works as a launcher, intercepting interactions between the application and the CUDA driver. No application modification is needed, but optional annotations (nvtx) and compiler flags (line info) can enhance profiling.
Overhead and Design:* There’s some overhead due to interception, but the focus is on minimizing impact on kernel execution during profiling. Replay passes are used to collect more data than can be gathered in a single run, given hardware limitations.
Profiling Collection and Replay:* Nsight Compute replays the kernel multiple times to collect different metrics on each pass, enabling comprehensive data gathering. The process includes saving/restoring GPU state and optionally locking clocks to ensure consistency.
Context Saving Details:* Memory that the kernel can access is copied. Ideally, this copy is done on the GPU (device-to-device), but can fall back to system memory or even disk if needed.
Live Demo (Simple Vector Add)😗 Magnus demonstrates the tool using a simple vector addition kernel to show expected results and basic features.
Summary Page😗 The summary page shows an overview, including duration, throughput, register usage, grid size, and rule-based recommendations.
Details Page😗 This page provides a detailed breakdown of metrics, organized into sections (e.g., Speed of Light, Memory Workload Analysis, Scheduler Statistics, Warp State Statistics). Each section contains tables, charts, and rule-based analysis.
Memory Chart😗 Visualizes the memory hierarchy, showing data flow and bottlenecks.
Bottleneck Table😗 Shows the sorted top bottlenecks, helping to understand the impact of fixing an issue.
Scheduler Statistics😗 Shows warp scheduler utilization, active/eligible warps, and stall reasons.
Warp State Statistics😗 Provides a breakdown of stall reasons (e.g., long scoreboard for memory dependencies).
Instruction Statistics😗 Displays a histogram of low-level assembly instructions.
Occupancy😗 Shows the standard occupancy chart and calculator.
Source Page😗 Correlates high-level code (Cuda C or Python with Numba) with low-level assembly instructions, displaying metrics at the source line and assembly line levels. Shows assembly dependencies.
Live Demo (Image Sharpening with Numba)😗 A more complex example using a Numba-based image sharpening kernel is profiled.
Rule Output😗 Rule system suggestions are displayed.
Diffing Reports:* The tool allows comparing reports (before and after code changes) to see the impact of optimizations, with visual cues and percentage changes.
Live Demo (Tensor Core Operation)😗 A kernel using tensor core operations on a GB100 (Blackwell) is profiled.
Tensor Core Roofline😗 Shows the specific tensor core operations used and their performance relative to peak.
TMA Engine😗 TMA is used to copy from device to shared memory.
Python Report Interface😗 Nsight Compute provides a Python API for accessing the report data, allowing custom post-processing and analysis.
Extensibility:* Sections and rules within Nsight Compute are defined in text files and can be modified or extended by users.
Additional Resources:* Links to documentation, forums, and training videos are provided.