当前位置: 首页 > wzjs >正文

专门做名片的网站长春网站制作

专门做名片的网站,长春网站制作,华亭网站建设,政府门户网站建设的目标《OpenShift / RHEL / DevSecOps 汇总目录》 说明:本文已经在 OpenShift 4.18 的环境中验证 文章目录 环境说明部署测试应用Time-slicingMPSMIG参考 环境说明 本文使用的 OpenShift 集群中有 2 个 Worker 节点,每个节点各有一块 NVIDIA L4,整…

《OpenShift / RHEL / DevSecOps 汇总目录》
说明:本文已经在 OpenShift 4.18 的环境中验证

文章目录

  • 环境说明
  • 部署测试应用
  • Time-slicing
  • MPS
  • MIG
  • 参考

环境说明

本文使用的 OpenShift 集群中有 2 个 Worker 节点,每个节点各有一块 NVIDIA L4,整个集群有 2 个 物理 GPU。
不过 NVIDIA L4 不支持 MIG,如有支持 MIG 的 GPU,可参见 NVIDIA 官方文档进行配置。

部署测试应用

  1. 查看 https://github.com/liuxiaoyu-git/gpu-partitioning-guide/blob/main/gpu-sharing-tests/base/nvidia-plugin-test-deployment.yaml,确认在 Deployment 的 pod 中将申请 1 个 GPU 运行名为 dcgmproftester12 的进程。
  2. 运行命令,部署 GPU 测试应用 nvidia-plugin-test。
git clone https://github.com/liuxiaoyu-git/gpu-partitioning-guide && cd gpu-partitioning-guide/
oc new-project demo
oc apply -k gpu-sharing-tests/overlays/default/

Time-slicing

  1. 将运行 nvidia-plugin-test 的 pod 数量增加到 2 个。因为集群有 2 个物理 GPU,因此 2 个 pod 可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
  1. 将运行 nvidia-plugin-test 的 pod 数量增加到 4 个。因为集群只有 2 个物理 GPU,因此只有 2 个 pod 可以正常运行,另 2 个 pod 处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-k8kbc   0/1     Pending   0          109s
nvidia-plugin-test-bc49df6b8-mqnwz   0/1     Pending   0          109s
  1. 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
5s          Warning   FailedScheduling           pod/nvidia-plugin-test-bc49df6b8-grjzj    0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  1. 为 GPU 启用 time-slicing 策略,将其一分为二。确认此时 ClusterPolicy 的配置已为 time-sliced。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-2$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "time-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
。。。
data:no-time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 0time-sliced: |-version: v1sharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 2
。。。
  1. 确认此时 4 个 pod 都可以运行起来了。
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          10m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          109s
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          109s
  1. 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 time-slicing,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name:               ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009841nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     2              2
Name:               ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746009846nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=time-slicingnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=falsenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     2              2
  1. 分别进入 2 个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认都有 2 个运行 dcgmproftester12 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   79C    P0             71W /   72W |     591MiB /  23034MiB |     99%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     39955      C   /usr/bin/dcgmproftester12                     582MiB |
|    0   N/A  N/A     40746      C   /usr/bin/dcgmproftester12                     582MiB |
++-----------------------------------------------------------------------------------------+$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[1].metadata.name}") -- nvidia-smi
Wed Apr 30 10:47:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   81C    P0             71W /   72W |     591MiB /  23034MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     73271      C   /usr/bin/dcgmproftester12                     582MiB |
|    0   N/A  N/A     74911      C   /usr/bin/dcgmproftester12                     582MiB |
+-----------------------------------------------------------------------------------------+
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认此时有 2 个 pod 因为获得不到 GPU 资源而处于 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-2gm4j   0/1     Pending   0          4m31s
nvidia-plugin-test-bc49df6b8-7gv2w   0/1     Pending   0          4m31s
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          24m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          24m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          15m
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          15m
  1. 修改 GPU 的 time-slicing 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/time-sliced-4
  1. 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-2gm4j   1/1     Running   0          6m21s
nvidia-plugin-test-bc49df6b8-7gv2w   1/1     Running   0          6m21s
nvidia-plugin-test-bc49df6b8-5ctlb   1/1     Running   0          26m
nvidia-plugin-test-bc49df6b8-d5hf7   1/1     Running   0          26m
nvidia-plugin-test-bc49df6b8-k8kbc   1/1     Running   0          17m
nvidia-plugin-test-bc49df6b8-mqnwz   1/1     Running   0          17m

MPS

  1. 将运行 nvidia-plugin-test 的 pod 数量减到 2 个。
$ oc scale deployment/nvidia-plugin-test --replicas=2 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS        RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running       0          30m
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running       0          37m
  1. 为 GPU 启用 mps 策略,将其一分为二。注意:因为 mps 需要使用 daemonset 作为控制器,因此可观察在启用 mps 后 daemonset 在两个 Worker 节点的运行状况。
$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
nvidia-device-plugin-mps-control-daemon         0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                                  73m$ oc apply -k gpu-sharing-instance/instance/overlays/mps-2
configmap/device-plugin-config configured
configmap/nvidia-dcgm-exporter-dashboard-c7bf99fb7g unchanged
clusterpolicy.nvidia.com/gpu-cluster-policy configured$ oc get daemonset nvidia-device-plugin-mps-control-daemon -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
nvidia-device-plugin-mps-control-daemon         2         2         2       2            2           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                                  7h54m
  1. 确认此时 ClusterPolicy 的配置已为 mps-sliced。
$ oc get ClusterPolicy gpu-cluster-policy -ojsonpath={.spec.devicePlugin} | jq
{"config": {"default": "mps-sliced","name": "device-plugin-config"},"enabled": true
}$ oc get cm device-plugin-config -n nvidia-gpu-operator -oyaml
...
data:mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 2no-mps-sliced: |-version: v1sharing:mps:resources:- name: nvidia.com/gpureplicas: 0
...
  1. 观察 Worker 节点中和 GPU 相关的配置和状态。确认 nvidia.com/gpu.product 为 NVIDIA-L4-SHARED,nvidia.com/gpu.replicas 为 2,nvidia.com/gpu.sharing-strategy 为 mps,且 GPU 的 Capacity、Allocatable 和 Allocated resources 都为 2。
$ oc describe node -l node-role.kubernetes.io/worker | egrep 'Name:|Capacity|nvidia.com/|Allocatable:|Allocated resources'
Name:               ip-10-0-36-231.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005265nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     1              1
Name:               ip-10-0-9-1.us-east-2.compute.internalnvidia.com/cuda.driver-version.full=550.144.03nvidia.com/cuda.driver-version.major=550nvidia.com/cuda.driver-version.minor=144nvidia.com/cuda.driver-version.revision=03nvidia.com/cuda.driver.major=550nvidia.com/cuda.driver.minor=144nvidia.com/cuda.driver.rev=03nvidia.com/cuda.runtime-version.full=12.4nvidia.com/cuda.runtime-version.major=12nvidia.com/cuda.runtime-version.minor=4nvidia.com/cuda.runtime.major=12nvidia.com/cuda.runtime.minor=4nvidia.com/gfd.timestamp=1746005260nvidia.com/gpu-driver-upgrade-state=upgrade-donenvidia.com/gpu.compute.major=8nvidia.com/gpu.compute.minor=9nvidia.com/gpu.count=1nvidia.com/gpu.deploy.container-toolkit=truenvidia.com/gpu.deploy.dcgm=truenvidia.com/gpu.deploy.dcgm-exporter=truenvidia.com/gpu.deploy.device-plugin=truenvidia.com/gpu.deploy.driver=truenvidia.com/gpu.deploy.gpu-feature-discovery=truenvidia.com/gpu.deploy.node-status-exporter=truenvidia.com/gpu.deploy.nvsm=nvidia.com/gpu.deploy.operator-validator=truenvidia.com/gpu.family=amperenvidia.com/gpu.machine=g6.4xlargenvidia.com/gpu.memory=23034nvidia.com/gpu.mode=computenvidia.com/gpu.present=truenvidia.com/gpu.product=NVIDIA-L4-SHAREDnvidia.com/gpu.replicas=2nvidia.com/gpu.sharing-strategy=mpsnvidia.com/mig.capable=falsenvidia.com/mig.strategy=singlenvidia.com/mps.capable=truenvidia.com/vgpu.present=falsenvidia.com/gpu-driver-upgrade-enabled: true
Capacity:nvidia.com/gpu:     2
Allocatable:nvidia.com/gpu:     2
Allocated resources:nvidia.com/gpu     1              1
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 4 个,确认都可以正常运行。
$ oc scale deployment/nvidia-plugin-test --replicas=4 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          34m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          41m
  1. 进入每个 Worker 节点运行 nvidia-driver 的 pod 内部查看 GPU 状态,确认除了有运行 dcgmproftester12 的进程,还有运行 nvidia-cuda-mps-server 的进程。
$ oc exec -n nvidia-gpu-operator $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi
Wed Apr 30 10:08:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   80C    P0             72W /   72W |    1033MiB /  23034MiB |    100%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     85231      C   nvidia-cuda-mps-server                         28MiB |
|    0   N/A  N/A     99955    M+C   /usr/bin/dcgmproftester12                     498MiB |
|    0   N/A  N/A     99958    M+C   /usr/bin/dcgmproftester12                     498MiB |
+-----------------------------------------------------------------------------------------+
  1. 将运行 nvidia-plugin-test 的 pod 数增加到 6 个,确认有 2 个 pod 为 Pending 状态。
$ oc scale deployment/nvidia-plugin-test --replicas=6 -n demo
$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          34m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          6s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          41m
nvidia-plugin-test-bc49df6b8-h6jj5   1/1     Pending   0          1m55s
nvidia-plugin-test-bc49df6b8-plm5n   1/1     Pending   0          1m55s
  1. 通过查看 pod 的事件,可以看到由于 “Insufficient nvidia.com/gpu”,导致 FailedScheduling。
$ oc get event -n demo | grep pod/$(oc get pod --field-selector=status.phase=Pending -ojsonpath={.items[0].metadata.name})
3m24s       Warning   FailedScheduling    pod/nvidia-plugin-test-bc49df6b8-cps8h     0/5 nodes are available: 2 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  1. 修改 GPU 的 mps 策略,将其一分为四。
$ oc apply -k gpu-sharing-instance/instance/overlays/mps-4
  1. 在重置 ClusterPolicy 的 /spec/devicePlugin/enabled 的状态后,确认 6 个 pod 都为 Running 状态了。
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": false}]'
$ oc patch ClusterPolicy gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/enabled", "value": true}]'$ oc get pod -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-bc49df6b8-7876r   1/1     Running   0          7m26s
nvidia-plugin-test-bc49df6b8-h6jj5   1/1     Running   0          4m55s
nvidia-plugin-test-bc49df6b8-plm5n   1/1     Running   0          4m55s
nvidia-plugin-test-bc49df6b8-sl2kh   1/1     Running   0          42m
nvidia-plugin-test-bc49df6b8-w2z9g   1/1     Running   0          7m26s
nvidia-plugin-test-bc49df6b8-zppf8   1/1     Running   0          49m

MIG

请参见:
《MIG Support in OpenShift Container Platform 》
《The benefits of dynamic GPU slicing in OpenShift》

参考

https://github.com/liuxiaoyu-git/gpu-partitioning-guide
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/time-slicing-gpus-in-openshift.html

http://www.dtcms.com/wzjs/88932.html

相关文章:

  • 四川城乡与建设厅网站网站制作app免费软件
  • 域名申请成功后怎么做网站怎么创建网页链接
  • 北京网络营销网站网络营销课程作业
  • 建什么网站做cpa赣州seo唐三
  • 网站仿站教程神马seo教程
  • 郑州发布最新消息今天拼多多seo搜索优化
  • 网站建设 工具seo黑帽多久入门
  • 做网站主要注意些什么seo智能优化系统
  • 建设网站对服务器有什么要求个人接外包项目平台
  • wordpress对接COS后网站变慢seo与网络推广的区别和联系
  • 徐州网站优化seo外链优化
  • 化妆品网站建设目标与期望全国疫情实时动态
  • 网站目标关键词阿里巴巴国际贸易网站
  • asp.net网站设计分工营销方案ppt
  • 静海的做网站苏州网站优化排名推广
  • wordpress 音乐站主题贵州seo技术培训
  • 大型网站tag标签 索引百度seo培训要多少钱
  • java购物网站开发流程市场监督管理局职责
  • 企业创建网站厦门网站搜索引擎优化
  • wordpress企业免费模板南通seo网站优化软件
  • 西安网站制作工程师永久免费google搜索引擎
  • 做推广的网站需要注意什么开创集团与百度
  • 常德建设企业网站公司seo
  • 不安装word使用wordpress结构优化设计
  • 网站做app的软件有哪些网络口碑营销
  • 做家电维修网站宁波seo排名公司
  • 智能建站收费标准萧山市seo关键词排名
  • java和php哪个做网站好抖音代运营公司
  • 广州网站建设有哪些做app找什么公司
  • 实搜网站建设广告文案