当前位置: 首页 > news >正文

如何在k8s中配置并使用nvidia显卡

0. 安装驱动依赖

0.1 安装cuda

# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

0.2 安装驱动

# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
sudo apt-get install -y cuda-drivers

1. 安装 nvidia container toolkit

# 参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
sudo apt-get update && sudo apt-get install -y --no-install-recommends \curl \gnupg2curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1sudo apt-get install -y \nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

重启container

sudo nvidia-ctk runtime configure --runtime=containerd
# 默认情况下,该nvidia-ctk命令会创建一个/etc/containerd/conf.d/99-nvidia.toml 临时配置文件,并修改(或创建)该/etc/containerd/config.toml文件以确保imports配置选项得到相应更新。该临时配置文件确保 containerd 可以使用 NVIDIA 容器运行时。
sudo systemctl restart containerd

2. 配置nvidia k8s插件

参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

2.1 创建RuntimeClass

需要在nvidia-device-plugin.yml中调用

cat <<EOF | kubectl create -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:name: nvidia
handler: nvidia
EOF

或者

sudo nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default # 默认使用 nvidia runtime
sudo systemctl restart containerd

2.2 创建 nvidia-device-plugin

方式一:

# 注意:需默认使用 nvidia runtime, nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

方式二:

# 获取yaml文件
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml# 在yaml文件中加入字段:runtimeClassName: nvidia
如:
apiVersion: apps/v1
kind: DaemonSet
...
spec:selector:matchLabels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Exists
...# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"runtimeClassName: nvidia     ## 添加到这里containers:- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1name: nvidia-device-plugin-ctr

执行

kubectl create -f nvidia-device-plugin.yml

3. 验证

# 1. 查看nvidia-device-plugin pod
kubectl describe pod nvidia-device-plugin-daemonset-sm24n -n kube-system
结果:
Name:                 nvidia-device-plugin-daemonset-sm24n
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
...
Events:Type    Reason     Age   From               Message----    ------     ----  ----               -------Normal  Scheduled  27s   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-sm24n to masterNormal  Pulled     26s   kubelet            Container image "nvcr.io/nvidia/k8s-device-plugin:v0.17.1" already present on machineNormal  Created    26s   kubelet            Created container nvidia-device-plugin-ctrNormal  Started    26s   kubelet            Started container nvidia-device-plugin-ctr# 2. 查看node 中是否已经有了nvida 的resource
kubectl describe node master
结果:
Name:               master
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64beta.kubernetes.io/os=linuxfeature.node.kubernetes.io/cpu-cpuid.ADX=truefeature.node.kubernetes.io/cpu-cpuid.AESNI=truefeature.node.kubernetes.io/cpu-cpuid.AVX=truefeature.node.kubernetes.io/cpu-cpuid.AVX2=true
....
Allocated resources:(Total limits may be over 100 percent, i.e., overcommitted.)Resource           Requests     Limits--------           --------     ------cpu                2100m (6%)   1900m (5%)memory             3088Mi (9%)  8696Mi (27%)ephemeral-storage  0 (0%)       0 (0%)hugepages-1Gi      0 (0%)       0 (0%)hugepages-2Mi      0 (0%)       0 (0%)nvidia.com/gpu     0            0             # nvidia 信息# 3. 如果gpu可用,通过官方测试脚本加载gpu
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:name: gpu-pod
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0resources:limits:nvidia.com/gpu: 1 # requesting 1 GPUtolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule
EOF# 通过 logs查看结果
kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

4. 常见问题

nvidia-device-plugin未发现可用gpu

nvidia-device-plugin 的pod describeti提示没有发现可以gpu
在驱动、runtime都正确安全的情况下一般是,运行时的问题

通过创建RuntimeClass或者在nvidia-ctk 中添加–nvidia-set-as-default解决,参考第2步。

gpu-pod报错问题

kubectl logs gpu-pod
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

版本问题:cuda-sample:vectoradd-cuda12.5.0

http://www.dtcms.com/a/582560.html

相关文章:

  • 【天野学院5期】 第5期易语言半内存辅助培训班,主讲游戏——手游:仙剑奇侠传4,端游:神魔大陆2
  • 深度学习-损失函数
  • 淮南建设公司网站如何做外围网站的代理
  • 测试开发话题11---自动化测试实战篇
  • 单播、广播、组播
  • 不用建网站怎么做淘宝客wordpress 分类分页
  • 防水网站的外链如何找临汾市网站建设
  • 公司app与网站建设方案网站域名备案密码
  • 【智慧城市】2025年华中农业大学暑期实训优秀作品(2):基于Vue框架和Java后端开发
  • C++面试常见问题
  • 品牌网站建设权威logo库官网
  • AI驱动开发新范式:基于 CodeWave 的考勤系统落地实践
  • PCI总线驱动开发全解析
  • 做网站数据库表设计Wordpress企业主题XShuan
  • 买完域名网站怎么设计房产中介网站开发模板
  • AVL树实现
  • Vue 组件插槽的深层传递
  • HENGSHI SENSE 6.1 发布,从 ChatBI 到 Agentic Analytics
  • 网站 哪些服务器wordpress新编辑器分类
  • 网站进度条源代码juqery-ui泰安网站建设公司
  • 11月7日星期五今日早报简报微语报早读
  • 网站维护一般都是维护什么公司注册网站需要提供什么文件
  • 网站开发现在是热门专业吗福建网站建建设
  • wordpress 首页无法访问seo信息编辑招聘
  • Nginx配置DNS缓存
  • 生信工作流框架搭建 | 01-nextflow、snakemake、wdl 对比测试
  • Windows 下 ROS/ROS2 开发环境最优解:WSL 比直接安装、虚拟机、双系统更优雅!
  • (Linux (6):从包管理到工具探索,构建系统操作基础认知)
  • 网站建设哪家专业为什么电脑打不开网页
  • wordpress 4.7.9seo推广有哪些