OpenShift AI - OpenShift 支持的 NVIDIA GPU 共享和分区技术 2
《OpenShift / RHEL / DevSecOps 汇总目录》
说明:本文已经在 OpenShift 4.18 的环境中验证
文章目录
- 环境说明
- 部署测试应用
- Time-slicing
- MPS
- MIG
- 参考
环境说明
本文使用的 OpenShift 集群中有 2 个 Worker 节点,每个节点各有一块 NVIDIA L4,整个集群有 2 个 物理 GPU。
不过 NVIDIA L4 不支持 MIG,如有支持 MIG 的 GPU,可参见 NVIDIA 官方文档进行配置。
部署测试应用
- 查看 https://github.com/liuxiaoyu-git/gpu-partitioning-guide/blob/main/gpu-sharing-tests/base/nvidia-plugin-test-deployment.yaml,确认在 Deployment 的 pod 中将申请 1 个 GPU 运行名为 dcgmproftester12 的进程。
- 运行命令,部署 GPU 测试应用 nvidia-plugin-test。
git clone https://github.com/liuxiaoyu-git/gpu-partitioning-guide && cd gpu-partitioning-guide/
oc new-project demo
oc apply -k gpu-sharing-tests/overlays/default/
Time-slicing
- 将运行 nvidia-plugin-test 的 pod 数量增加到 2 个。因为集群有 2 个物理 GPU,因此 2 个 pod 可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-5ctlb 1/1 Running 0 10m
nvidia-plugin-test-bc49df6b8-d5hf7 1/1 Running 0 10m
- 将运行 nvidia-plugin-test 的 pod 数量增加到 4 个。因为集群只有 2 个物理 GPU,因此只有 2 个 pod 可以正常运行,另 2 个 pod 处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-5ctlb 1/1 Running 0 10m
nvidia-plugin-test-bc49df6b8-d5hf7 1/1 Running 0 10m
nvidia-plugin-test-bc49df6b8-k8kbc 0/1 Pending 0 109s
nvidia-plugin-test-bc49df6b8-mqnwz 0/1 Pending 0 109s
- 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
5s Warning FailedScheduling pod/nvidia-plugin-test-bc49df6b8-grjzj 0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
- 为 GPU 启用 time-slicing 策略,将其一分为二。确认此时 ClusterPolicy 的配置已为 time-sliced。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-2$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "time-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
。。。
data:no-time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 0time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 2
。。。
- 确认此时 4 个 pod 都可以运行起来了。
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-5ctlb 1/1 Running 0 10m
nvidia-plugin-test-bc49df6b8-d5hf7 1/1 Running 0 10m
nvidia-plugin-test-bc49df6b8-k8kbc 1/1 Running 0 109s
nvidia-plugin-test-bc49df6b8-mqnwz 1/1 Running 0 109s
- 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 time-slicing,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name: ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009841nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu: 2
Allocatable:nvidia.com/gpu: 2
Allocated resources:nvidia.com/gpu 2 2
Name: ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009846nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu: 2
Allocatable:nvidia.com/gpu: 2
Allocated resources:nvidia.com/gpu 2 2
- 分别进入 2 个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认都有 2 个运行 dcgmproftester12 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 |
| N/A 79C P0 71W / 72W | 591MiB / 23034MiB | 99% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 39955 C /usr/bin/dcgmproftester12 582MiB |
| 0 N/A N/A 40746 C /usr/bin/dcgmproftester12 582MiB |
++-----------------------------------------------------------------------------------------+$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[1].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 |
| N/A 81C P0 71W / 72W | 591MiB / 23034MiB | 99% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 73271 C /usr/bin/dcgmproftester12 582MiB |
| 0 N/A N/A 74911 C /usr/bin/dcgmproftester12 582MiB |
+-----------------------------------------------------------------------------------------+
- 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认此时有 2 个 pod 因为获得不到 GPU 资源而处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-2gm4j 0/1 Pending 0 4m31s
nvidia-plugin-test-bc49df6b8-7gv2w 0/1 Pending 0 4m31s
nvidia-plugin-test-bc49df6b8-5ctlb 1/1 Running 0 24m
nvidia-plugin-test-bc49df6b8-d5hf7 1/1 Running 0 24m
nvidia-plugin-test-bc49df6b8-k8kbc 1/1 Running 0 15m
nvidia-plugin-test-bc49df6b8-mqnwz 1/1 Running 0 15m
- 修改 GPU 的 time-slicing 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-4
- 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-2gm4j 1/1 Running 0 6m21s
nvidia-plugin-test-bc49df6b8-7gv2w 1/1 Running 0 6m21s
nvidia-plugin-test-bc49df6b8-5ctlb 1/1 Running 0 26m
nvidia-plugin-test-bc49df6b8-d5hf7 1/1 Running 0 26m
nvidia-plugin-test-bc49df6b8-k8kbc 1/1 Running 0 17m
nvidia-plugin-test-bc49df6b8-mqnwz 1/1 Running 0 17m
MPS
- 将运行 nvidia-plugin-test 的 pod 数量减到 2 个。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-sl2kh 1/1 Running 0 30m
nvidia-plugin-test-bc49df6b8-zppf8 1/1 Running 0 37m
- 为 GPU 启用 mps 策略,将其一分为二。注意:因为 mps 需要使用 daemonset 作为控制器,因此可观察在启用 mps 后 daemonset 在两个 Worker 节点的运行状况。
$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 73m$ oc apply -k gpu-sharing-instance/instance/overlays/mps-2
configmap/device-plugin-config configured
configmap/nvidia-dcgm-exporter-dashboard-c7bf99fb7g unchanged
clusterpolicy.nvidia.com/gpu-cluster-policy configured$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-device-plugin-mps-control-daemon 2 2 2 2 2 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 7h54m
- 确认此时 ClusterPolicy 的配置已为 mps-sliced。
$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "mps-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
...
data:mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 2no-mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 0
...
- 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 mps,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name: ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005265nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu: 2
Allocatable:nvidia.com/gpu: 2
Allocated resources:nvidia.com/gpu 1 1
Name: ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005260nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu: 2
Allocatable:nvidia.com/gpu: 2
Allocated resources:nvidia.com/gpu 1 1
- 将运行 nvidia-plugin-test 的 pod 数增加到 4 个,确认都可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-7876r 1/1 Running 0 6s
nvidia-plugin-test-bc49df6b8-sl2kh 1/1 Running 0 34m
nvidia-plugin-test-bc49df6b8-w2z9g 1/1 Running 0 6s
nvidia-plugin-test-bc49df6b8-zppf8 1/1 Running 0 41m
- 进入每个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认除了有运行 dcgmproftester12 的进程,还有运行 nvidia-cuda-mps-server 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:08:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 |
| N/A 80C P0 72W / 72W | 1033MiB / 23034MiB | 100% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 85231 C nvidia-cuda-mps-server 28MiB |
| 0 N/A N/A 99955 M+C /usr/bin/dcgmproftester12 498MiB |
| 0 N/A N/A 99958 M+C /usr/bin/dcgmproftester12 498MiB |
+-----------------------------------------------------------------------------------------+
- 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认有 2 个 pod 为 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo
$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-7876r 1/1 Running 0 6s
nvidia-plugin-test-bc49df6b8-sl2kh 1/1 Running 0 34m
nvidia-plugin-test-bc49df6b8-w2z9g 1/1 Running 0 6s
nvidia-plugin-test-bc49df6b8-zppf8 1/1 Running 0 41m
nvidia-plugin-test-bc49df6b8-h6jj5 1/1 Pending 0 1m55s
nvidia-plugin-test-bc49df6b8-plm5n 1/1 Pending 0 1m55s
- 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
3m24s Warning FailedScheduling pod/nvidia-plugin-test-bc49df6b8-cps8h 0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
- 修改 GPU 的 mps 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/mps-4
- 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME READY STATUS RESTARTS AGE
nvidia-plugin-test-bc49df6b8-7876r 1/1 Running 0 7m26s
nvidia-plugin-test-bc49df6b8-h6jj5 1/1 Running 0 4m55s
nvidia-plugin-test-bc49df6b8-plm5n 1/1 Running 0 4m55s
nvidia-plugin-test-bc49df6b8-sl2kh 1/1 Running 0 42m
nvidia-plugin-test-bc49df6b8-w2z9g 1/1 Running 0 7m26s
nvidia-plugin-test-bc49df6b8-zppf8 1/1 Running 0 49m
MIG
请参见:
《MIG Support in OpenShift Container Platform 》
《The benefits of dynamic GPU slicing in OpenShift》
参考
https://github.com/liuxiaoyu-git/gpu-partitioning-guide
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/time-slicing-gpus-in-openshift.html