如何在k8s中配置并使用nvidia显卡
0. 安装驱动依赖
0.1 安装cuda
# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
0.2 安装驱动
# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
sudo apt-get install -y cuda-drivers
1. 安装 nvidia container toolkit
# 参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
sudo apt-get update && sudo apt-get install -y --no-install-recommends \curl \gnupg2curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1sudo apt-get install -y \nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
重启container
sudo nvidia-ctk runtime configure --runtime=containerd
# 默认情况下,该nvidia-ctk命令会创建一个/etc/containerd/conf.d/99-nvidia.toml 临时配置文件,并修改(或创建)该/etc/containerd/config.toml文件以确保imports配置选项得到相应更新。该临时配置文件确保 containerd 可以使用 NVIDIA 容器运行时。
sudo systemctl restart containerd
2. 配置nvidia k8s插件
参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
2.1 创建RuntimeClass
需要在nvidia-device-plugin.yml中调用
cat <<EOF | kubectl create -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:name: nvidia
handler: nvidia
EOF
或者
sudo nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default # 默认使用 nvidia runtime
sudo systemctl restart containerd
2.2 创建 nvidia-device-plugin
方式一:
# 注意:需默认使用 nvidia runtime, nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
方式二:
# 获取yaml文件
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml# 在yaml文件中加入字段:runtimeClassName: nvidia
如:
apiVersion: apps/v1
kind: DaemonSet
...
spec:selector:matchLabels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Exists
...# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"runtimeClassName: nvidia ## 添加到这里containers:- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1name: nvidia-device-plugin-ctr
执行
kubectl create -f nvidia-device-plugin.yml
3. 验证
# 1. 查看nvidia-device-plugin pod
kubectl describe pod nvidia-device-plugin-daemonset-sm24n -n kube-system
结果:
Name: nvidia-device-plugin-daemonset-sm24n
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
...
Events:Type Reason Age From Message---- ------ ---- ---- -------Normal Scheduled 27s default-scheduler Successfully assigned kube-system/nvidia-device-plugin-daemonset-sm24n to masterNormal Pulled 26s kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.17.1" already present on machineNormal Created 26s kubelet Created container nvidia-device-plugin-ctrNormal Started 26s kubelet Started container nvidia-device-plugin-ctr# 2. 查看node 中是否已经有了nvida 的resource
kubectl describe node master
结果:
Name: master
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64beta.kubernetes.io/os=linuxfeature.node.kubernetes.io/cpu-cpuid.ADX=truefeature.node.kubernetes.io/cpu-cpuid.AESNI=truefeature.node.kubernetes.io/cpu-cpuid.AVX=truefeature.node.kubernetes.io/cpu-cpuid.AVX2=true
....
Allocated resources:(Total limits may be over 100 percent, i.e., overcommitted.)Resource Requests Limits-------- -------- ------cpu 2100m (6%) 1900m (5%)memory 3088Mi (9%) 8696Mi (27%)ephemeral-storage 0 (0%) 0 (0%)hugepages-1Gi 0 (0%) 0 (0%)hugepages-2Mi 0 (0%) 0 (0%)nvidia.com/gpu 0 0 # nvidia 信息# 3. 如果gpu可用,通过官方测试脚本加载gpu
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:name: gpu-pod
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0resources:limits:nvidia.com/gpu: 1 # requesting 1 GPUtolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule
EOF# 通过 logs查看结果
kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
4. 常见问题
nvidia-device-plugin未发现可用gpu
nvidia-device-plugin 的pod describeti提示没有发现可以gpu
在驱动、runtime都正确安全的情况下一般是,运行时的问题
通过创建RuntimeClass或者在nvidia-ctk 中添加–nvidia-set-as-default解决,参考第2步。
gpu-pod报错问题
kubectl logs gpu-pod
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]
版本问题:cuda-sample:vectoradd-cuda12.5.0
