当前位置: 首页 > news >正文

Lecture 44: NVIDIA Profiling (未完)

缘起

一个CUDA 大佬分享给我一个视频,花点时间消化一下.

视频大纲总结

  • Introduction & Format:* The talk will focus on using NVIDIA Insight Compute for kernel profiling. Magnus, who has 15 years of experience with profiling tools, will lead the session, showing mostly live profiling traces after a brief slide overview.
  • Tool Overview:* Jackson provides a quick overview of NVIDIA’s developer tools, including debuggers (CUDA GDB), profilers (Nsight Systems, Nsight Compute), and correctness checkers (Compute Sanitizer).
  • Profiling Workflow:* The recommended workflow starts with Nsight Systems for a high-level platform overview, then moves to Nsight Compute for detailed kernel analysis.
  • Nsight Systems:* Nsight Systems is a timeline-based tool for system-level profiling, showing CPU-GPU interactions, memory transfers, and some GPU-wide metrics.
  • Nsight Compute Introduction (Magnus)😗 Nsight Compute is a low-level kernel profiler to gather as much data as possible about a specific kernel or small set of kernels for detailed optimizations. The data includes source code, and assembly line level analysis.
  • Tool Operation:* Nsight Compute works as a launcher, intercepting interactions between the application and the CUDA driver. No application modification is needed, but optional annotations (nvtx) and compiler flags (line info) can enhance profiling.
  • Overhead and Design:* There’s some overhead due to interception, but the focus is on minimizing impact on kernel execution during profiling. Replay passes are used to collect more data than can be gathered in a single run, given hardware limitations.
  • Profiling Collection and Replay:* Nsight Compute replays the kernel multiple times to collect different metrics on each pass, enabling comprehensive data gathering. The process includes saving/restoring GPU state and optionally locking clocks to ensure consistency.
  • Context Saving Details:* Memory that the kernel can access is copied. Ideally, this copy is done on the GPU (device-to-device), but can fall back to system memory or even disk if needed.
  • Live Demo (Simple Vector Add)😗 Magnus demonstrates the tool using a simple vector addition kernel to show expected results and basic features.
  • Summary Page😗 The summary page shows an overview, including duration, throughput, register usage, grid size, and rule-based recommendations.
  • Details Page😗 This page provides a detailed breakdown of metrics, organized into sections (e.g., Speed of Light, Memory Workload Analysis, Scheduler Statistics, Warp State Statistics). Each section contains tables, charts, and rule-based analysis.
  • Memory Chart😗 Visualizes the memory hierarchy, showing data flow and bottlenecks.
  • Bottleneck Table😗 Shows the sorted top bottlenecks, helping to understand the impact of fixing an issue.
  • Scheduler Statistics😗 Shows warp scheduler utilization, active/eligible warps, and stall reasons.
  • Warp State Statistics😗 Provides a breakdown of stall reasons (e.g., long scoreboard for memory dependencies).
  • Instruction Statistics😗 Displays a histogram of low-level assembly instructions.
  • Occupancy😗 Shows the standard occupancy chart and calculator.
  • Source Page😗 Correlates high-level code (Cuda C or Python with Numba) with low-level assembly instructions, displaying metrics at the source line and assembly line levels. Shows assembly dependencies.
  • Live Demo (Image Sharpening with Numba)😗 A more complex example using a Numba-based image sharpening kernel is profiled.
  • Rule Output😗 Rule system suggestions are displayed.
  • Diffing Reports:* The tool allows comparing reports (before and after code changes) to see the impact of optimizations, with visual cues and percentage changes.
  • Live Demo (Tensor Core Operation)😗 A kernel using tensor core operations on a GB100 (Blackwell) is profiled.
  • Tensor Core Roofline😗 Shows the specific tensor core operations used and their performance relative to peak.
  • TMA Engine😗 TMA is used to copy from device to shared memory.
  • Python Report Interface😗 Nsight Compute provides a Python API for accessing the report data, allowing custom post-processing and analysis.
  • Extensibility:* Sections and rules within Nsight Compute are defined in text files and can be modified or extended by users.
  • Additional Resources:* Links to documentation, forums, and training videos are provided.

核心Slide 总结 (github 没有提供PPT)

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

好奇 & 待解决的疑问

1. 如何在库里面支持trace? (cuDNN/cuBlas)

2. 单次 与多次 profile 有什么区别?

  • 见图5 解释,多次采样原因, 一个合理解释:再最小化观测GPU性能时,不能抽取太多的PMU,影响实际的运行情况,所以要多次运行抽取PMU, 不同的pass收集不同的性能指标数据.
  • 单一路径指标是啥?

3. 为什么要降低采样频率? 在大尺寸计算密集型Kernel ?

4. 为什么可以截获应用运行的情况? NVTX 插桩技术.

5. 什么是主机拦截开销? (host interception)

6. 如何在收集CPU端数据时,不影响GPU性能?

7. 如何确定bound在CPU?

8. 如何关联程序语言/汇编器/ 性能参数的影响? (-linetable/lineinfo)

9.图6 的Replay Pass 具体是怎样的?

  • 数据量很大的时候,存 GPU内存 or CPU内存? (可否设置/)

相关文章:

  • 10种电阻综合对比——《器件手册--电阻》
  • CNN-SE-Attention-ITCN多特征输入回归预测(Matlab完整源码和数据)
  • DeepSeek推动办公智能向“人机共智”阶段跃迁
  • centos7 yum install docker 安装错误
  • java面试篇 并发编程篇
  • 低代码开发:重塑软件开发的未来
  • MCP server的stdio和SSE分别是什么?
  • 网络初识 - Java
  • C# Winform 入门(11)之制作酷炫灯光效果
  • DeepSeek 教我 C++ (8) :C++ 静态类型不安全的情况
  • 内网渗透(杂项集合) --- 中的多协议与漏洞利用技术(杂项知识点 重点) 持续更新
  • Three.js 系列专题 3:光照与阴影
  • Spring Data JPA中的List底层:深入解析ArrayList的奥秘!!!
  • linux Gitkraken 破解
  • 基于springboot协同过滤算法的农产品销售推荐系统(源码+lw+部署文档+讲解),源码可白嫖!
  • 多进程/线程访问动态库全局变量的安全性
  • 套接字编程函数recv和send ,以及设置reuseaddress
  • 《星环之城:量子迷雾下的网络安全战记》
  • C++中如何在一个字符串的结尾添加字符或者字符串--append()函数实现
  • Redis基础知识
  • 万网网站建设教程/搜索广告是什么意思
  • 动态网站开发工程师证/今天热点新闻事件
  • 荆门做网站公司/微指数查询入口
  • 门户网站建设费用科目/亚马逊跨境电商开店流程及费用
  • wordpress显示分页/福州seo代理计费
  • wordpress自动推荐插件/seo是哪里