【踩坑记录】nvidia-smi 能识别 GPU,但 torch.cuda.is_available() 报错的终极解决方案
?问题描述
在一台多 GPU 的服务器上,运行如下命令:
nvidia-smi
可以正常看到 GPU0–GPU2,甚至 GPU3 也有显示(虽然状态异常):
Unable to determine the device handle for GPU3: 0000:E3:00.0: Unknown Error
Sat May 24 17:55:27 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:17:00.0 Off | N/A |
| 41% 33C P8 33W / 350W | 15MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:65:00.0 Off | N/A |
| 48% 31C P8 25W / 350W | 15MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 Off | 00000000:CA:00.0 Off | N/A |
| 42% 34C P8 28W / 350W | 15MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2406 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2406 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2406 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
但是在 Python 中运行:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
却返回:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)return torch._C._cuda_getDeviceCount() > 0RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
初步排查思路
✅ GPU 是存在的,nvidia-smi
正常
✅ PyTorch 是 GPU 版本,非 +cpu。之前在这个环境中成功运行过相关代码
❌ torch.cuda
初始化失败
怀疑是 某块 GPU(如 GPU3)硬件或驱动状态异常,导致整个 CUDA 驱动初始化失败。
最终解决方案:无需重启,修复 CUDA 驱动状态
安装 nvidia-modprobe
sudo apt install nvidia-modprobe
sudo nvidia-modprobe
卸载并重载 NVIDIA 驱动模块:
# 卸载驱动模块
sudo rmmod nvidia_uvm# 重新加载驱动模块
sudo modprobe nvidia_uvm
最终验证成功🎉
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))True
NVIDIA GeForce RTX 3090
参考资料