sglang pytorch NCCL hang分析
sglang部署出现卡死现象,通过cuda-gdb分析发现是NCCL卡死
(cuda-gdb) info cuda kernelsKernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation
* 0 - 1 3424489 Active 0x0000000000000000000000000000ff00ff (16,1,1) (544,1,1) ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL()
但是默认没有打印调用栈,通过设置如下环境变量,打印NCCL错误信息和算子调用栈:
export NCCL_DEBUG=INFOexport TORCH_NCCL_TRACE_BUFFER_SIZE=40960
export TORCH_NCCL_TRACE_CPP_STACK=true
export TORCH_NCCL_DUMP_ON_TIMEOUT=true