网络考试seo教程优化
总体步骤:
-
部署dcgm-exporter的DaemonSet和Service,确保Service有正确的标签和端口。
-
创建ServiceMonitor,选择dcgm-exporter的Service,并指定端口。
-
检查Prometheus的targets页面,确认dcgm-exporter是否被正确发现和抓取。
-
可能需要调整Prometheus的RBAC或网络策略,确保访问权限。
1,部署dcgm-exporter
创建dcgm-exporter.yaml文件,包含DaemonSet和Service:
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:name: dcgm-exporternamespace: monitoring # 假设与kube-prometheus同命名空间labels:app: dcgm-exporter
spec:selector:matchLabels:app: dcgm-exportertemplate:metadata:labels:app: dcgm-exporterspec:nodeSelector:nvidia.com/gpu.present: "true" # 仅在有GPU的节点运行tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedulecontainers:- name: dcgm-exporterimage: nvidia/dcgm-exporter:3.3.4-3.6.12-ubuntu22.04ports:- containerPort: 9400resources:limits:nvidia.com/gpu: 1 # 分配1个GPU
---
apiVersion: v1
kind: Service
metadata:name: dcgm-exporternamespace: monitoringlabels:app: dcgm-exporter # ServiceMonitor通过此标签选择
spec:selector:app: dcgm-exporterports:- name: metricsport: 9400targetPort: 9400
应用配置:
kubectl apply -f dcgm-exporter.yaml
2. 创建ServiceMonitor资源
创建dcgm-service-monitor.yaml文件,定义如何抓取指标:
# dcgm-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: dcgm-exporternamespace: monitoring # 必须与Prometheus Operator监控的命名空间匹配labels:release: kube-prometheus # 根据实际kube-prometheus部署的标签调整
spec:jobLabel: dcgm-exporterendpoints:- port: metrics # 对应Service的端口名称interval: 15spath: /metrics # 指标路径selector:matchLabels:app: dcgm-exporter # 匹配Service的标签namespaceSelector:matchNames:- monitoring # 指定Service所在的命名空间
应用配置:
kubectl apply -f dcgm-service-monitor.yaml
3,验证配置
检查Pod状态
kubectl get pods -n monitoring -l app=dcgm-exporter
查看Service和Endpoints:
kubectl get svc,ep -n monitoring dcgm-exporter
访问Prometheus UI:
1,端口转发Prometheus服务:
kubectl --namespace monitoring port-forward svc/prometheus-operated 9090
2,打开浏览器访问 http://localhost:9090/targets
,确认dcgm-exporter目标状态为UP。
查询GPU指标:
在Prometheus中输入DCGM_FI_DEV_GPU_UTIL
验证指标是否存在。