当前位置: 首页 > news >正文

OpenShift AI - OpenShift 支持的 NVIDIA GPU 共享和分区技术 2

《OpenShift / RHEL / DevSecOps 汇总目录》
说明:本文已经在 OpenShift 4.18 的环境中验证

文章目录

  • 环境说明
  • 部署测试应用
  • Time-slicing
  • MPS
  • MIG
  • 参考

环境说明

本文使用的 OpenShift 集群中有 2 个 Worker 节点,每个节点各有一块 NVIDIA L4,整个集群有 2 个 物理 GPU。
不过 NVIDIA L4 不支持 MIG,如有支持 MIG 的 GPU,可参见 NVIDIA 官方文档进行配置。

部署测试应用

  1. 查看 https://github.com/liuxiaoyu-git/gpu-partitioning-guide/blob/main/gpu-sharing-tests/base/nvidia-plugin-test-deployment.yaml,确认在 Deployment 的 pod 中将申请 1 个 GPU 运行名为 dcgmproftester12 的进程。
  2. 运行命令,部署 GPU 测试应用 nvidia-plugin-test。
git clone https://github.com/liuxiaoyu-git/gpu-partitioning-guide && cd gpu-partitioning-guide/
oc new-project demo
oc apply -k gpu-sharing-tests/overlays/default/

Time-slicing

  1. 将运行 nvidia-plugin-test 的 pod 数量增加到 2 个。因为集群有 2 个物理 GPU,因此 2 个 pod 可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
  1. 将运行 nvidia-plugin-test 的 pod 数量增加到 4 个。因为集群只有 2 个物理 GPU,因此只有 2 个 pod 可以正常运行,另 2 个 pod 处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-k8kbc   0/1     Pending   0          109s
nvidia-plugin-test-bc49df6b8-mqnwz   0/1     Pending   0          109s
  1. 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
5s          Warning   FailedScheduling           pod/nvidia-plugin-test-bc49df6b8-grjzj    0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  1. 为 GPU 启用 time-slicing 策略,将其一分为二。确认此时 ClusterPolicy 的配置已为 time-sliced。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-2$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "time-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
。。。
data:no-time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 0time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 2
。。。
  1. 确认此时 4 个 pod 都可以运行起来了。
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          109s
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          109s
  1. 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 time-slicing,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name:               ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009841nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     2              2
Name:               ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009846nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     2              2
  1. 分别进入 2 个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认都有 2 个运行 dcgmproftester12 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   79C    P0             71W /   72W |     591MiB /  23034MiB |     99%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     39955      C   /usr/bin/dcgmproftester12                     582MiB |
|    0   N/A  N/A     40746      C   /usr/bin/dcgmproftester12                     582MiB |
++-----------------------------------------------------------------------------------------+$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[1].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   81C    P0             71W /   72W |     591MiB /  23034MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     73271      C   /usr/bin/dcgmproftester12                     582MiB |
|    0   N/A  N/A     74911      C   /usr/bin/dcgmproftester12                     582MiB |
+-----------------------------------------------------------------------------------------+
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认此时有 2 个 pod 因为获得不到 GPU 资源而处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-2gm4j   0/1     Pending   0          4m31s
nvidia-plugin-test-bc49df6b8-7gv2w   0/1     Pending   0          4m31s
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          24m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          24m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          15m
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          15m
  1. 修改 GPU 的 time-slicing 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-4
  1. 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-2gm4j   1/1     Running   0          6m21s
nvidia-plugin-test-bc49df6b8-7gv2w   1/1     Running   0          6m21s
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          26m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          26m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          17m
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          17m

MPS

  1. 将运行 nvidia-plugin-test 的 pod 数量减到 2 个。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS        RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running       0          30m
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running       0          37m
  1. 为 GPU 启用 mps 策略,将其一分为二。注意:因为 mps 需要使用 daemonset 作为控制器,因此可观察在启用 mps 后 daemonset 在两个 Worker 节点的运行状况。
$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
nvidia-device-plugin-mps-control-daemon         0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                                  73m$ oc apply -k gpu-sharing-instance/instance/overlays/mps-2
configmap/device-plugin-config configured
configmap/nvidia-dcgm-exporter-dashboard-c7bf99fb7g unchanged
clusterpolicy.nvidia.com/gpu-cluster-policy configured$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
nvidia-device-plugin-mps-control-daemon         2         2         2       2            2           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                                  7h54m
  1. 确认此时 ClusterPolicy 的配置已为 mps-sliced。
$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "mps-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
...
data:mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 2no-mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 0
...
  1. 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 mps,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name:               ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005265nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     1              1
Name:               ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005260nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     1              1
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 4 个,确认都可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          34m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          41m
  1. 进入每个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认除了有运行 dcgmproftester12 的进程,还有运行 nvidia-cuda-mps-server 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:08:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   80C    P0             72W /   72W |    1033MiB /  23034MiB |    100%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     85231      C   nvidia-cuda-mps-server                         28MiB |
|    0   N/A  N/A     99955    M+C   /usr/bin/dcgmproftester12                     498MiB |
|    0   N/A  N/A     99958    M+C   /usr/bin/dcgmproftester12                     498MiB |
+-----------------------------------------------------------------------------------------+
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认有 2 个 pod 为 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          34m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          41m
nvidia-plugin-test-bc49df6b8-h6jj5   1/1     Pending   0          1m55s
nvidia-plugin-test-bc49df6b8-plm5n   1/1     Pending   0          1m55s
  1. 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
3m24s       Warning   FailedScheduling    pod/nvidia-plugin-test-bc49df6b8-cps8h     0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  1. 修改 GPU 的 mps 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/mps-4
  1. 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          7m26s
nvidia-plugin-test-bc49df6b8-h6jj5   1/1     Running   0          4m55s
nvidia-plugin-test-bc49df6b8-plm5n   1/1     Running   0          4m55s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          42m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          7m26s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          49m

MIG

请参见:
《MIG Support in OpenShift Container Platform 》
《The benefits of dynamic GPU slicing in OpenShift》

参考

https://github.com/liuxiaoyu-git/gpu-partitioning-guide
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/time-slicing-gpus-in-openshift.html

相关文章:

  • Netty 的 Reactor 模型
  • 我用cursor 搭建了临时邮箱服务-Temp Mail 365
  • K-means
  • 机器学习-简要与数据集加载
  • React Native【详解】搭建开发环境,创建项目,启动项目
  • 虚拟机连不上网只有lo地址
  • 高频PCB设计如何选择PCB层数?
  • 2025年LangChain(V0.3)开发与综合案例
  • Spring MVC 如何自动将请求参数映射到 Controller 方法的参数对象(POJO)上?
  • 把其他conda的env复制到自己电脑的conda上
  • 【ULMD】基于单峰标签生成和模态分解的多模态情感分析
  • Java大师成长计划之第13天:Java中的响应式编程
  • Python实现NOA星雀优化算法优化BP神经网络分类模型项目实战
  • P56-P60 统一委托,关联游戏UI,UI动画,延迟血条
  • CSS Border 三角形阴影与多重边框的制作
  • ROS2:自定义接口文件(无废话)
  • 第100+40步 ChatGPT学习:R语言实现多轮建模
  • 使用Homebrew下载配置git和连接GitHub(Mac版)
  • dubbo限流
  • VMware Fusion安装win11 arm;使用Mac远程连接到Win
  • 金融监管总局:做好2025年小微企业金融服务工作
  • 巴称击落多架印度“阵风”战机,专家:小规模冲突巴空军战力不落下风
  • 魔都眼|上海多家商场打开绿色通道,助力外贸出口商品转内销
  • 潘功胜:将创设科技创新债券风险分担工具
  • 预告:央行等部门将发声,介绍“一揽子金融政策支持稳市场稳预期”有关情况
  • 丁薛祥在学习《习近平经济文选》第一卷专题研讨班上强调,深入学习贯彻习近平经济思想,加强党中央对经济工作的集中统一领导