kubernetes环境手动部署 Prometheus 监控系统安装文档
前言:文中“实操示例”配置内容,可按需要进行拆解安装配置
一、环境准备
- Kubernetes 集群
确保已部署 Kubernetes 集群(版本 ≥1.20),且kubectl
工具已配置。 - 镜像仓库
确认镜像harbor.fq.com/prometheus/node-exporter:v1.8.2
和 Prometheus 相关镜像在私有仓库中可用。 - 命名空间
默认使用default
命名空间,可根据需求调整至monitoring
(需同步修改所有 YAML 文件中的namespace
字段)。
二、创建 RBAC 权限
目标:为 Prometheus 分配访问 Kubernetes API 的权限。
1. 创建 ServiceAccount
# prometheus-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: prometheus
secrets:
- name: prometheus-token
解释:
- ServiceAccount
prometheus
用于 Prometheus 的身份认证。 secrets
字段关联一个 Secret(prometheus-token
),存储访问凭证。
2. 创建 ClusterRole
# prometheus-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: prometheus
rules:- apiGroups: [""]resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"]verbs: ["get", "list", "watch"]- apiGroups: [""]resources: ["configmaps"]verbs: ["get"]- nonResourceURLs: ["/metrics"]verbs: ["get"]
解释:
- 授予 Prometheus 访问节点、服务、Pod 等资源的权限。
- 允许读取
/metrics
端点(非资源 URL)。
3. 创建 ClusterRoleBinding
# prometheus-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: prometheus
subjects:
- kind: ServiceAccountname: prometheusnamespace: default
roleRef:kind: ClusterRolename: prometheusapiGroup: rbac.authorization.k8s.io
解释:
- 将
prometheus
ClusterRole 绑定到prometheus
ServiceAccount,确保权限生效。
4. 生成 ServiceAccount Token
# prometheus-token.yaml
apiVersion: v1
kind: Secret
metadata:name: prometheus-tokenannotations:kubernetes.io/service-account.name: prometheus
type: kubernetes.io/service-account-token
应用 RBAC 配置:
kubectl apply -f prometheus-serviceaccount.yaml
kubectl apply -f prometheus-clusterrole.yaml
kubectl apply -f prometheus-clusterrolebinding.yaml
kubectl apply -f prometheus-token.yaml
☆实操示例
cat prometheus-rabc0227.yaml
---
# 1. 创建 monitoring 命名空间
apiVersion: v1
kind: Namespace
metadata:name: monitoring
---
# 2. 创建 Prometheus 使用的 ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:name: prometheusnamespace: monitoring
---
# 3. 创建 ClusterRole,定义 Prometheus 的权限
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: prometheus
rules:
- apiGroups: [""]resources:- nodes- nodes/metrics- services- endpoints- podsverbs: ["get", "list", "watch"]
- apiGroups: [""]resources:- configmapsverbs: ["get"]
- apiGroups: [""]resources:- nodes/proxyverbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]resources:- ingressesverbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]verbs: ["get"]
---
# 4. 将 ClusterRole 绑定到 ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: prometheus
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: prometheus
subjects:
- kind: ServiceAccountname: prometheusnamespace: monitoring
---
三、部署 Node Exporter
目标:在每个节点上部署 Node Exporter,收集节点资源指标。
# node-exporter-daemonset.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:name: node-exporternamespace: kube-system
spec:selector:matchLabels:app: node-exportertemplate:metadata:labels:app: node-exporterspec:hostNetwork: truecontainers:- name: node-exporterimage: harbor.fq.com/prometheus/node-exporter:v1.8.2args:- --path.rootfs=/hostvolumeMounts:- name: rootfsmountPath: /hostvolumes:- name: rootfshostPath:path: /
解释:
DaemonSet
确保每个节点运行一个 Node Exporter Pod。hostNetwork: true
使用节点网络,直接暴露节点指标。hostPath
挂载根文件系统,用于收集节点级数据。
部署命令:
kubectl apply -f node-exporter-daemonset.yml
☆实操示例
cat node-exporter-daemonset.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:name: node-exporternamespace: monitoring # 使用 "monitoring" 命名空间labels:k8s-app: node-exporter
spec:selector:matchLabels:k8s-app: node-exportertemplate:metadata:labels:k8s-app: node-exporterannotations:prometheus.io/scrape: "true" # 允许 Prometheus 抓取数据prometheus.io/port: "9100" # 指定 Node Exporter 端口spec:hostNetwork: true # 允许 Pod 使用主机网络hostPID: true # 允许访问主机的 PID 进程tolerations:- effect: NoSchedule # 允许调度到 tainted 节点operator: Exists- effect: NoExecuteoperator: ExistssecurityContext:runAsNonRoot: true # 避免使用 root 权限runAsUser: 65534 # 运行时使用 nobody 用户containers:- name: node-exporterimage: harbor.fq.com/prometheus/node-exporter:v1.8.2 # 替换为可信赖的镜像地址args:- --path.rootfs=/host/root # 设定 rootfs 路径- --path.procfs=/host/proc # 设定 procfs 路径- --path.sysfs=/host/sys # 设定 sysfs 路径- --no-collector.wifi # 禁用 WiFi 采集- --no-collector.hwmon # 禁用硬件监控采集ports:- containerPort: 9100protocol: TCPresources: # 资源请求与限制requests:memory: "30Mi"cpu: "100m"limits:memory: "50Mi"cpu: "200m"volumeMounts: # 挂载主机目录- name: procmountPath: /host/procreadOnly: true- name: sysmountPath: /host/sysreadOnly: true- name: rootfsmountPath: /host/rootreadOnly: truevolumes:- name: prochostPath:path: /proc- name: syshostPath:path: /sys- name: rootfshostPath:path: /
---
apiVersion: v1
kind: Service
metadata:name: node-exporternamespace: monitoringlabels:k8s-app: node-exporterannotations:prometheus.io/scrape: 'true' # 允许 Prometheus 采集prometheus.io/port: '9100' # 采集端口
spec:selector:k8s-app: node-exporterports:- name: metricsport: 9100protocol: TCPtargetPort: 9100type: ClusterIP # 仅在集群内部可访问
四、部署 Prometheus
目标:部署 Prometheus 主服务,配置抓取规则和持久化存储。
1. 创建持久化存储卷(PV/PVC)
根据集群存储类型(如 NFS、Local PV、云存储),创建 PVC 并挂载到 Prometheus。
示例(需根据实际环境调整):
# prometheus-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:name: prometheus-data
spec:accessModes:- ReadWriteOnceresources:requests:storage: 50Gi
2. 创建 Prometheus Deployment
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: prometheus
spec:replicas: 1selector:matchLabels:app: prometheustemplate:metadata:labels:app: prometheusspec:serviceAccountName: prometheuscontainers:- name: prometheusimage: prom/prometheus:v2.42.0args:- "--config.file=/etc/prometheus/prometheus.yml"ports:- containerPort: 9090volumeMounts:- name: config-volumemountPath: /etc/prometheus- name: data-volumemountPath: /prometheusvolumes:- name: config-volumeconfigMap:name: prometheus-config- name: data-volumepersistentVolumeClaim:claimName: prometheus-data
☆实操示例
apiVersion: apps/v1
kind: Deployment
metadata:name: prometheusnamespace: monitoring # 指定命名空间labels:app: prometheus
spec:replicas: 1 # 生产环境通常建议 1 个实例,使用远程存储提高可用性selector:matchLabels:app: prometheustemplate:metadata:labels:app: prometheusspec:serviceAccountName: prometheus # 关联 ServiceAccount,便于 RBAC 访问containers:- name: prometheusimage: harbor.fq.com/prometheus/prometheus:v3.1.0 # 使用私有仓库镜像args:- --config.file=/etc/prometheus/prometheus.yml # 指定 Prometheus 配置文件- --storage.tsdb.path=/prometheus # 存储 TSDB 数据的位置- --web.console.templates=/etc/prometheus/consoles- --web.console.libraries=/etc/prometheus/console_librariesports:- containerPort: 9090 # Prometheus Web 界面端口resources: # 限制 CPU 和内存,防止资源耗尽requests:cpu: "500m"memory: "1Gi"limits:cpu: "1"memory: "2Gi"volumeMounts:- name: prometheus-configmountPath: /etc/prometheus # 配置文件挂载点- name: prometheus-storagemountPath: /prometheus # TSDB 数据存储路径- name: file-sdmountPath: /apps/prometheus/file-sd.yaml # 动态目标发现文件路径subPath: file-sd.yaml # 仅挂载文件,而不是整个目录volumes:- name: prometheus-configconfigMap:name: prometheus-config # 从 ConfigMap 挂载 Prometheus 配置- name: prometheus-storage# persistentVolumeClaim: # 生产环境使用 PVC 持久化存储# claimName: prometheus-pvcemptyDir: {} # 测试环境可使用空目录- name: file-sdhostPath:path: /root/file-sd.yaml # 使用主机上的动态发现文件type: File
---
apiVersion: v1
kind: Service
metadata:name: prometheusnamespace: monitoringlabels:app: prometheus
spec:type: NodePort # 在生产环境中建议使用 LoadBalancer 或 Ingressports:- port: 9090targetPort: 9090nodePort: 30090 # 通过 NodePort 访问 Web 界面selector:app: prometheus
3. 创建 Prometheus ConfigMap
# prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: prometheus-config
data:prometheus.yml: |global:scrape_interval: 15sevaluation_interval: 15salerting:alertmanagers:- static_configs:- targets: ['alertmanager:9093']rule_files:- '/etc/prometheus/alert_rules.yml'scrape_configs:- job_name: 'prometheus'static_configs:- targets: ['localhost:9090']- job_name: 'node-exporter'static_configs:- targets: ['node-exporter:9100']- job_name: 'cadvisor'static_configs:- targets: ['cadvisor:8080']- job_name: 'pushgateway'static_configs:- targets: ['pushgateway:9091']- job_name: 'node-linux'static_configs:- targets: ['10.255.209.40:9100']- job_name: 'kubernetes-apiservers'kubernetes_sd_configs:- role: endpointskubeconfig_file: /var/run/secrets/kubernetes.io/serviceaccount/kubeconfigtls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtbearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenscheme: httpsrelabel_configs:- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]action: keepregex: default;kubernetes;https- job_name: 'kubernetes-nodes'kubernetes_sd_configs:- role: nodekubeconfig_file: /var/run/secrets/kubernetes.io/serviceaccount/kubeconfigtls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenscheme: httpsrelabel_configs:- source_labels: [_meta_kubernetes_node_ip]regex: '(.*):10250' # Kubernetes 节点的默认 kubelet 端口replacement: '${1}:9100' # Node Exporter 的监听端口target_label: __address__action: replace- action: labelmapregex: __meta_kubernetes_node_label_(.+)- job_name: 'kubernetes-pods'kubernetes_sd_configs:- role: podnamespaces:names:- kube-system- defaulttls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenscheme: httpsrelabel_configs:- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]action: keepregex: true- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]action: replacetarget_label: __metrics_path__regex: (.+)- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]action: replacetarget_label: __scheme__regex: (.+)- source_labels: [__meta_kubernetes_pod_ip]action: replacetarget_label: __address__regex: (.+)replacement: ${1}:9090- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs:- role: endpointskubeconfig_file: /var/run/secrets/kubernetes.io/serviceaccount/kubeconfigtls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenscheme: httpsrelabel_configs:- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]action: keepregex: true- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]action: replacetarget_label: __scheme__regex: (https?)- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]action: replacetarget_label: __metrics_path__regex: (.+)- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]action: replacetarget_label: __address__regex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2- action: labelmapregex: __meta_kubernetes_service_label_(.+)- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: kubernetes_namespace- source_labels: [__meta_kubernetes_service_name]action: replacetarget_label: kubernetes_service_name
应用配置:
kubectl apply -f prometheus-pvc.yaml
kubectl apply -f prometheus-configmap.yaml
kubectl apply -f prometheus-deployment.yaml
☆实操示例
cat prometheus-configmap0227.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: prometheus-confignamespace: monitoring
data:prometheus.yml: |global:scrape_interval: 15sevaluation_interval: 15sscrape_timeout: 10s # 添加超时时间,避免抓取任务卡住alerting:alertmanagers:- static_configs:- targets: ['alertmanager:9093']rule_files:- '/etc/prometheus/alert_rules.yml'scrape_configs:# 抓取 Prometheus 自身指标- job_name: 'prometheus'static_configs:- targets: ['localhost:9090']# 抓取 Node Exporter 指标- job_name: 'node-exporter'static_configs:- targets: ['node-exporter:9100']# 抓取 cAdvisor 指标- job_name: 'cadvisor'static_configs:- targets: ['cadvisor:8080']# 抓取 Pushgateway 指标- job_name: 'pushgateway'static_configs:- targets: ['pushgateway:9091']# 抓取特定节点的 Node Exporter 指标- job_name: 'node-linux'static_configs:- targets: ['10.255.209.40:9100']# 抓取 Kubernetes API Server 指标- job_name: 'kubernetes-apiservers'kubernetes_sd_configs:- role: endpointsscheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: true # 生产环境中建议关闭,配置正确的 CA 证书bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs:- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]action: keepregex: default;kubernetes;https# 抓取 Kubernetes 节点指标(通过 Node Exporter)- job_name: 'kubernetes-nodes'kubernetes_sd_configs:- role: noderelabel_configs:- source_labels: [__address__]regex: '(.*):10250'replacement: '${1}:9100' # 将 kubelet 端口替换为 Node Exporter 端口target_label: __address__- action: labelmapregex: __meta_kubernetes_node_label_(.+)# 抓取 Kubernetes Pods 指标- job_name: 'kubernetes-pods'kubernetes_sd_configs:- role: podrelabel_configs:- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]action: keepregex: true- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]action: replacetarget_label: __metrics_path__regex: (.+)- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]action: replacetarget_label: __address__regex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2- action: labelmapregex: __meta_kubernetes_pod_label_(.+)- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: kubernetes_namespace- source_labels: [__meta_kubernetes_pod_name]action: replacetarget_label: kubernetes_pod_name# 抓取 Kubernetes Service Endpoints 指标- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs:- role: endpoints#scheme: https#tls_config:# ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt# insecure_skip_verify: true # 生产环境中建议关闭,配置正确的 CA 证书#bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs:- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]action: keepregex: true- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]action: replacetarget_label: __scheme__regex: (https?)- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]action: replacetarget_label: __metrics_path__regex: (.+)- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]action: replacetarget_label: __address__regex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2- action: labelmapregex: __meta_kubernetes_service_label_(.+)- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: kubernetes_namespace- source_labels: [__meta_kubernetes_service_name]action: replacetarget_label: kubernetes_service_name- job_name: 'kubernetes-nginx-endpoints' # 任务名称kubernetes_sd_configs:- role: endpoints # 自动发现 Kubernetes Endpointsrelabel_configs:# 只抓取带有 `prometheus.io/scrape: "true"` 注解的 Service- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]action: keepregex: true# 替换抓取协议(http 或 https)- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]action: replacetarget_label: __scheme__regex: (https?)# 替换指标路径(默认为 /metrics)- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]action: replacetarget_label: __metrics_path__regex: (.+)# 替换抓取地址和端口- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]action: replacetarget_label: __address__regex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2# 将 Kubernetes 标签映射到 Prometheus 标签- action: labelmapregex: __meta_kubernetes_service_label_(.+)# 添加 Kubernetes Namespace 标签- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: kubernetes_namespace# 添加 Kubernetes Service 名称标签- source_labels: [__meta_kubernetes_service_name]action: replacetarget_label: kubernetes_service_name# 添加 Kubernetes Pod 名称标签- source_labels: [__meta_kubernetes_pod_name]action: replacetarget_label: kubernetes_pod_name# 添加 Kubernetes Node 名称标签- source_labels: [__meta_kubernetes_pod_node_name]action: replacetarget_label: kubernetes_node_name# 如果需要抓取 HTTPS 端点,取消注释以下配置# scheme: https# tls_config:# ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt# insecure_skip_verify: true # 生产环境中建议关闭,配置正确的 CA 证书# bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token- job_name: 'kube-state-metrics'kubernetes_sd_configs:- role: endpointsnamespaces:names:- kube-system- monitoring- defaultrelabel_configs:- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]action: keepregex: kube-state-metrics- source_labels: [__meta_kubernetes_endpoint_port_name]action: keepregex: http-metricsmetrics_path: /metricsscheme: http- job_name: "file_sd"file_sd_configs:- files:- /apps/prometheus/file-sd.yamlrefresh_interval: 1m- job_name: 'redis'kubernetes_sd_configs:- role: endpoints # 从 Kubernetes Endpoints 发现服务relabel_configs:# 只抓取带有 `prometheus.io/scrape: "true"` 注解的服务- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]action: keepregex: true# 替换目标地址为服务的 IP 和指定端口(9121)- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]action: keepregex: Pod;(.*redis.*) # 仅抓取名称包含 "redis" 的 Pod- source_labels: [__meta_kubernetes_pod_ip]action: replacetarget_label: __address__replacement: $1:9121 # 指定 Redis Exporter 的端口为 9121# 添加 Kubernetes 服务的 app 标签- source_labels: [__meta_kubernetes_service_label_app]action: replacetarget_label: app# 添加 Kubernetes 命名空间标签- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: namespace# 添加 Kubernetes 服务名称标签- source_labels: [__meta_kubernetes_service_name]action: replacetarget_label: service# 添加 Kubernetes Pod 名称标签- source_labels: [__meta_kubernetes_pod_name]action: replacetarget_label: pod# 添加 Kubernetes 节点名称标签- source_labels: [__meta_kubernetes_pod_node_name]action: replacetarget_label: node# 添加实例标签(用于区分不同的 Redis 实例)- source_labels: [__meta_kubernetes_pod_ip]action: replacetarget_label: instance- job_name: 'mysql'kubernetes_sd_configs:- role: endpoints # 从 Kubernetes Endpoints 发现服务relabel_configs:# 只抓取带有 `prometheus.io/scrape: "true"` 注解的服务- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]action: keepregex: true# 替换目标地址为服务的 IP 和指定端口(9104)- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]action: keepregex: Pod;(.*mysql-exporter.*) # 仅抓取名称包含 "mysql-exporter" 的 Pod- source_labels: [__meta_kubernetes_pod_ip]action: replacetarget_label: __address__replacement: $1:9104 # 指定 MySQL Exporter 的端口为 9104# 添加 Kubernetes 服务的 app 标签- source_labels: [__meta_kubernetes_service_label_app]action: replacetarget_label: app# 添加 Kubernetes 命名空间标签- source_labels: [__meta_kubernetes_namespace]action: replacetarget_label: namespace# 添加 Kubernetes 服务名称标签- source_labels: [__meta_kubernetes_service_name]action: replacetarget_label: service# 添加 Kubernetes Pod 名称标签- source_labels: [__meta_kubernetes_pod_name]action: replacetarget_label: pod# 添加 Kubernetes 节点名称标签- source_labels: [__meta_kubernetes_pod_node_name]action: replacetarget_label: node# 添加实例标签(用于区分不同的 MySQL 实例)- source_labels: [__meta_kubernetes_pod_ip]action: replacetarget_label: instance
4. 暴露 Prometheus 服务
# prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:name: prometheus
spec:type: NodePortports:- port: 9090targetPort: 9090nodePort: 30090selector:app: prometheus
应用服务:
kubectl apply -f prometheus-service.yaml
五、验证部署
-
检查 Pod 状态:
kubectl get pods -l app=prometheus -n default kubectl get pods -n kube-system -l app=node-exporter
预期输出:所有 Pod 状态为
Running
。 -
访问 Prometheus UI:
通过浏览器访问http://<NodeIP>:30090
,进入 Prometheus 控制台。- 在 Status > Targets 页面,确认
kubernetes-nodes
和kubernetes-pods
任务状态为UP
。 - 查询
up{job="kubernetes-nodes"}
验证指标抓取是否正常。
- 在 Status > Targets 页面,确认
六、常见问题排查
-
权限问题
- 错误示例:
Failed to list *v1.Pod: forbidden
- 解决:检查 ClusterRoleBinding 是否绑定到正确的 ServiceAccount 和命名空间。
- 错误示例:
-
Node Exporter 未启动
- 检查 DaemonSet 是否部署到所有节点,确认镜像拉取无错误。
-
Prometheus 无法抓取指标
- 检查 Prometheus 配置中的
scrape_configs
是否指向正确的端口(如 Node Exporter 默认端口为9100
)。 - 验证网络连通性:
kubectl exec -it prometheus-pod -- curl http://<NodeIP>:9100/metrics
。
- 检查 Prometheus 配置中的
七、后续优化
- 配置 Alertmanager:添加告警规则并集成 Alertmanager 实现告警通知。
- 持久化存储优化:使用高可用存储方案(如 Ceph、Longhorn)保障数据可靠性。
- 监控 Dashboard:部署 Grafana,导入 Prometheus 数据源并配置监控看板。