【云平台监控】Prometheus 监控平台部署与应用
文章目录
- Prometheus 监控系统
- 概述
- TSDB 存储引擎特点
- 核心特点
- 生态组件
- 工作流程
- 局限性
- 部署 Prometheus
- 1. Prometheus Server 部署
- 2. 部署 Exporters
- 3. 部署 Grafana
- 4. 服务发现
- Kubernetes 集群部署 Prometheus 和 Grafana 全流程指南
- 1. 环境准备
- 2. 部署 Node Exporter
- 功能:采集节点资源指标(CPU、内存、磁盘等)
- 步骤:
- 3. 部署 Prometheus
- 功能:时序数据库 + 告警规则引擎
- 步骤:
- 4. 部署 Grafana
- 功能:可视化监控数据
- 步骤:
- 5. 部署 Alertmanager(邮件告警)
- 功能:接收 Prometheus 告警并发送通知
- 步骤:
- 6. 关键配置说明
- 7. 验证与测试
- 总结
- 配置
- Kubernetes 集群外部署 Prometheus 并基于 API Server 实现服务发现
- 步骤
- 创建 RBAC 授权
- 获取 ServiceAccount 的 Token
- 配置 Prometheus
- 验证
- 关键点说明
Prometheus 监控系统
概述
- 定义:开源服务监控系统 & 时序数据库,提供通用数据模型、高效采集、存储和查询接口。
- 核心组件:Prometheus Server
- 数据采集:定期从静态配置或服务发现的目标中拉取数据(HTTP Pull,默认15秒/次)。
- 数据存储:内存暂存后持久化到磁盘(默认2小时回刷),支持WAL日志崩溃恢复。
- 告警处理:根据规则计算指标,触发告警后发送至Alertmanager。
- 数据源:
- Exporter:部署于被监控主机,暴露HTTP接口供Prometheus拉取指标。
- Pushgateway:接收短期任务主动推送的数据,供Prometheus拉取。
- 服务发现:支持静态配置或动态发现(如Kubernetes API集成)。
TSDB 存储引擎特点
- 适用场景:
- 海量时序数据存储,顺序写入为主。
- 数据按时间排序,极少更新。
- 块删除(按时间范围),非随机删除。
- 大体积数据,顺序读取为主,缓存效率低。
核心特点
- 多维数据模型:时间序列由指标名称 + 键值对标签标识。
- 内置时序数据库(TSDB):高效存储本地短期数据(默认15天)。
- PromQL查询语言:支持复杂多维查询。
- 混合采集模式:
- 默认HTTP Pull拉取。
- 通过Pushgateway支持Push模式。
- 动态服务发现:集成Kubernetes等系统,自动管理监控目标。
- 可视化集成:与Grafana无缝对接,提供仪表盘展示。
生态组件
组件 | 功能 |
---|---|
Prometheus Server | 核心组件,负责数据采集、存储、告警规则计算及PromQL查询。 |
Client Library | 为应用提供内置指标采集SDK(如Go、Java等语言支持)。 |
Exporters | 采集第三方系统指标并转换为Prometheus格式。常见Exporter: |
- Node Exporter:主机资源监控(CPU/内存/磁盘等)。 | |
- cAdvisor:容器资源监控。 | |
- kube-state-metrics:K8S资源对象状态监控(Pod/Deployment等)。 | |
- Blackbox Exporter:网络探测(HTTP/TCP/ICMP等)。 | |
Service Discovery | 动态发现监控目标(支持Kubernetes、Consul、DNS等)。 |
Alertmanager | 告警路由、去重、静默,支持邮件/钉钉/企业微信等通知渠道。 |
Pushgateway | 临时存储短期任务的推送数据,供Prometheus拉取。 |
Grafana | 可视化平台,集成Prometheus数据源,提供丰富的监控仪表盘。 |
工作流程
- 数据采集:Prometheus Server 基于服务发现或静态配置获取目标,并通过 exporter 拉取指标数据。
- 数据存储:采集到的数据通过 TSDB 存储到本地存储设备中。
- 告警生成:根据配置的告警规则,触发告警并发送到 Alertmanager。
- 告警通知:Alertmanager 发送告警通知到指定的接收方,如邮件、钉钉或企业微信。
- 数据查询:通过 Prometheus 自带的 Web UI 界面和 PromQL 查询语言查询监控数据。
- 数据可视化:Grafana 接入 Prometheus 数据源,将监控数据以图形化形式展示。
局限性
-
数据场景限制:适合指标监控,不适用于日志/事件存储。
-
长期存储:本地存储设计为短期保留,需借助InfluxDB等外部方案。
-
集群支持:原生集群机制不完善,可通过Thanos/Cortex实现高可用。
-
官网:prometheus.io
-
GitHub:github.com/prometheus
-
Node Exporter指标说明:链接
部署 Prometheus
1. Prometheus Server 部署
-
安装 Prometheus
- 上传并解压 Prometheus 安装包:
cd /opt/ tar xf prometheus-2.35.0.linux-amd64.tar.gz mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus
- 配置文件
prometheus.yml
解析:global: scrape_interval: 15s # 数据采集间隔 evaluation_interval: 15s # 告警规则评估间隔 scrape_timeout: 10s # 数据采集超时时间 alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"]
- 上传并解压 Prometheus 安装包:
-
配置系统启动文件
- 创建
prometheus.service
:cat > /usr/lib/systemd/system/prometheus.service <<'EOF' [Unit] Description=Prometheus Server Documentation=https://prometheus.io After=network.target [Service] Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/usr/local/prometheus/prometheus.yml \ --storage.tsdb.path=/usr/local/prometheus/data/ \ --storage.tsdb.retention.time=15d \ --web.enable-lifecycle ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target EOF
- 创建
-
启动 Prometheus
systemctl start prometheus systemctl enable prometheus netstat -natp | grep :9090
- 访问
http://<Prometheus_IP>:9090
查看 Web UI。
- 访问
2. 部署 Exporters
-
Node Exporter(系统级监控)
- 安装 Node Exporter:
cd /opt/ tar xf node_exporter-1.3.1.linux-amd64.tar.gz mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin
- 配置启动文件:
cat > /usr/lib/systemd/system/node_exporter.service <<'EOF' [Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target [Service] Type=simple ExecStart=/usr/local/bin/node_exporter \ --collector.ntp \ --collector.mountstats \ --collector.systemd \ --collector.tcpstat ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target EOF
- 启动 Node Exporter:
systemctl start node_exporter systemctl enable node_exporter netstat -natp | grep :9100
- 修改 Prometheus 配置文件,添加 Node Exporter 目标:
- job_name: nodes metrics_path: "/metrics" static_configs: - targets: - 192.168.80.30:9100 - 192.168.80.11:9100 - 192.168.80.12:9100 labels: service: kubernetes
- 重新加载配置:
curl -X POST http://192.168.80.30:9090/-/reload
- 安装 Node Exporter:
-
MySQL Exporter
- 安装 MySQL Exporter:
cd /opt/ tar xf mysqld_exporter-0.14.0.linux-amd64.tar.gz mv mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter /usr/local/bin/
- 配置 MySQL 用户权限:
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost' IDENTIFIED BY 'abc123';
- 启动 MySQL Exporter:
systemctl start mysqld_exporter systemctl enable mysqld_exporter netstat -natp | grep :9104
- 修改 Prometheus 配置文件,添加 MySQL Exporter 目标:
- job_name: mysqld metrics_path: "/metrics" static_configs: - targets: - 192.168.80.15:9104 labels: service: mysqld
- 安装 MySQL Exporter:
-
Nginx Exporter
- 安装 Nginx 和 Nginx VTS 模块:
./configure --prefix=/usr/local/nginx \ --add-module=/usr/local/nginx-module-vts make && make install
- 配置 Nginx:
http { vhost_traffic_status_zone; vhost_traffic_status_filter_by_host on; server { listen 8080; location /status { vhost_traffic_status_display; vhost_traffic_status_display_format html; } } }
- 启动 Nginx Exporter:
systemctl start nginx-exporter systemctl enable nginx-exporter netstat -natp | grep :9913
- 修改 Prometheus 配置文件,添加 Nginx Exporter 目标:
- job_name: nginx metrics_path: "/metrics" static_configs: - targets: - 192.168.80.15:9913 labels: service: nginx
- 安装 Nginx 和 Nginx VTS 模块:
3. 部署 Grafana
-
安装 Grafana
yum install -y grafana-7.4.0-1.x86_64.rpm systemctl start grafana-server systemctl enable grafana-server netstat -natp | grep :3000
- 访问
http://<Grafana_IP>:3000
,默认账号:admin/admin
。
- 访问
-
配置数据源
- 添加 Prometheus 数据源,URL 为
http://<Prometheus_IP>:9090
。
- 添加 Prometheus 数据源,URL 为
-
导入监控面板
- 从 Grafana Dashboards 下载模板,导入到 Grafana。
4. 服务发现
-
基于文件的服务发现
- 创建目标文件:
- targets: - 192.168.80.30:9100 - 192.168.80.15:9100 labels: app: node-exporter job: node
- 修改 Prometheus 配置文件:
- job_name: nodes file_sd_configs: - files: - targets/node*.yaml refresh_interval: 2m
- 创建目标文件:
-
基于 Consul 的服务发现
- 启动 Consul:
consul agent -server -bootstrap -ui -data-dir=/usr/local/consul/data -bind=192.168.80.14 -client=0.0.0.0 -node=consul-server01 &
- 注册服务:
{ "services": [ { "id": "node_exporter-node01", "name": "node01", "address": "192.168.80.30", "port": 9100, "tags": ["nodes"], "checks": [{ "http": "http://192.168.80.30:9100/metrics", "interval": "5s" }] } ] }
- 修改 Prometheus 配置文件:
- job_name: nodes consul_sd_configs: - server: 192.168.80.30:8500 tags: - nodes refresh_interval: 2m
- 启动 Consul:
Kubernetes 集群部署 Prometheus 和 Grafana 全流程指南
1. 环境准备
- 集群节点:
- 控制节点/master01:192.168.80.10
- 工作节点/node01:192.168.80.11
- 工作节点/node02:192.168.80.12
- 命名空间:
monitor-sa
(用于监控组件)
2. 部署 Node Exporter
功能:采集节点资源指标(CPU、内存、磁盘等)
步骤:
-
创建命名空间:
kubectl create ns monitor-sa
-
部署 DaemonSet:
# node-export.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitor-sa labels: name: node-exporter spec: selector: matchLabels: name: node-exporter template: metadata: labels: name: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:v0.16.0 ports: - containerPort: 9100 args: - --path.procfs=/host/proc - --path.sysfs=/host/sys volumeMounts: - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys
kubectl apply -f node-export.yaml
-
验证:
kubectl get pods -n monitor-sa -o wide curl http://<节点IP>:9100/metrics # 检查指标数据
3. 部署 Prometheus
功能:时序数据库 + 告警规则引擎
步骤:
-
创建 ServiceAccount 和 RBAC 授权:
kubectl create serviceaccount monitor -n monitor-sa kubectl create clusterrolebinding monitor-clusterrolebinding \ --clusterrole=cluster-admin \ --serviceaccount=monitor-sa:monitor
-
创建 Prometheus 配置 ConfigMap:
# prometheus-cfg.yaml kind: ConfigMap apiVersion: v1 metadata: name: prometheus-config namespace: monitor-sa data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubectl apply -f prometheus-cfg.yaml
-
部署 Prometheus:
# prometheus-deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-server namespace: monitor-sa spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: monitor containers: - name: prometheus image: prom/prometheus:v2.35.0 args: - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.path=/prometheus ports: - containerPort: 9090 volumeMounts: - name: config mountPath: /etc/prometheus volumes: - name: config configMap: name: prometheus-config
kubectl apply -f prometheus-deploy.yaml
-
创建 Service 暴露端口:
# prometheus-svc.yaml apiVersion: v1 kind: Service metadata: name: prometheus namespace: monitor-sa spec: type: NodePort ports: - port: 9090 nodePort: 31000 selector: app: prometheus
kubectl apply -f prometheus-svc.yaml
-
访问 Web UI:
http://<NodeIP>:31000
4. 部署 Grafana
功能:可视化监控数据
步骤:
-
部署 Grafana:
# grafana.yaml apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitor-sa spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:9.0.0 ports: - containerPort: 3000 --- apiVersion: v1 kind: Service metadata: name: grafana namespace: monitor-sa spec: type: NodePort ports: - port: 3000 nodePort: 32000 selector: app: grafana
kubectl apply -f grafana.yaml
-
配置数据源:
- 访问
http://<NodeIP>:32000
,默认账号admin/admin
- 添加 Prometheus 数据源:
http://prometheus.monitor-sa.svc:9090
- 访问
-
导入仪表盘:
- 从 Grafana Dashboards 搜索模板(如
3119
或315
),导入 JSON。
- 从 Grafana Dashboards 搜索模板(如
5. 部署 Alertmanager(邮件告警)
功能:接收 Prometheus 告警并发送通知
步骤:
-
创建 Alertmanager 配置:
# alertmanager-cm.yaml kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: | global: smtp_smarthost: 'smtp.qq.com:465' smtp_from: 'your-email@qq.com' smtp_auth_username: 'your-email@qq.com' smtp_auth_password: 'your-smtp-token' # QQ邮箱授权码 route: group_by: [alertname] receiver: 'default' receivers: - name: 'default' email_configs: - to: 'alert-receiver@example.com'
kubectl apply -f alertmanager-cm.yaml
-
部署 Alertmanager:
# alertmanager-deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager namespace: monitor-sa spec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: labels: app: alertmanager spec: containers: - name: alertmanager image: prom/alertmanager:v0.24.0 args: - --config.file=/etc/alertmanager/alertmanager.yml volumeMounts: - name: config mountPath: /etc/alertmanager volumes: - name: config configMap: name: alertmanager --- apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitor-sa spec: type: NodePort ports: - port: 9093 nodePort: 30066 selector: app: alertmanager
kubectl apply -f alertmanager-deploy.yaml
-
配置 Prometheus 告警规则:
# 在 prometheus-cfg.yaml 中添加告警规则 rule_files: - "alert-rules.yml"
6. 关键配置说明
-
服务发现:
- Kubernetes 原生支持:通过
kubernetes_sd_configs
自动发现节点、Pod、Service。 - Relabel 配置:重写标签以适配监控目标。
- Kubernetes 原生支持:通过
-
告警路由:
- 分组与抑制:通过
group_by
和repeat_interval
避免告警洪泛。 - 邮件模板:可自定义 HTML 模板提升告警可读性。
- 分组与抑制:通过
-
高可用建议:
- Prometheus 联邦:跨集群数据聚合。
- Thanos/Cortex:长期存储与查询优化。
7. 验证与测试
-
触发测试告警:
- 手动关闭某个节点上的
node-exporter
,观察 Prometheus 的Targets
状态。 - 检查 Alertmanager 界面(
http://<NodeIP>:30066
)是否收到告警。
- 手动关闭某个节点上的
-
邮件接收:
- 确保邮箱配置正确,检查垃圾邮件文件夹。
总结
通过以上步骤,可在 Kubernetes 集群中快速搭建 Prometheus + Grafana + Alertmanager 的全栈监控告警系统。关键点包括:
kind: ConfigMap
apiVersion: v1
metadata:
labels:
app: prometheus
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.
#
# If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
# for the kubernetes-cadvisor job; you will need to edit or remove this job.
# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
# insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Scrape config for nodes (kubelet).
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubernetes-nodes'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape config for Kubelet cAdvisor.
#
# This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
# (those whose names begin with 'container_') have been removed from the
# Kubelet metrics endpoint. This job scrapes the cAdvisor endpoint to
# retrieve those metrics.
#
# In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
# HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
# in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
# the --cadvisor-port=0 Kubelet flag).
#
# This job is not necessary and should be removed in Kubernetes 1.6 and
# earlier versions, or it will cause the metrics to be scraped twice.
- job_name: 'kubernetes-cadvisor'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
# if you want to use metrics on jobs, set the below field to
# true to prevent Prometheus from setting the `job` label
# automatically.
honor_labels: false
kubernetes_sd_configs:
- role: pod
# skip verification so you can do HTTPS to pods
tls_config:
insecure_skip_verify: true
# make sure your labels are in order
relabel_configs:
# these labels tell Prometheus to automatically attach source
# pod and namespace information to each collected sample, so
# that they'll be exposed in the custom metrics API automatically.
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# these labels tell Prometheus to look for
# prometheus.io/{scrape,path,port} annotations to configure
# how to scrape
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (.+)
Kubernetes 集群外部署 Prometheus 并基于 API Server 实现服务发现
- 场景:Prometheus 部署在 Kubernetes 集群外部,需要监控集群内的资源。
- 挑战:集群外 Prometheus 无法直接访问集群内的资源,需要通过 Kubernetes API Server 实现服务发现。
步骤
创建 RBAC 授权
-
创建 Namespace:
apiVersion: v1 kind: Namespace metadata: name: monitoring
-
创建 ServiceAccount:
apiVersion: v1 kind: ServiceAccount metadata: name: outside-prometheus namespace: monitoring
-
创建 ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: outside-prometheus rules: - apiGroups: [""] resources: ["nodes", "services", "endpoints", "pods", "nodes/proxy"] verbs: ["get", "list", "watch"] - apiGroups: ["networking.k8s.io"] resources: ["ingresses"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["configmaps", "nodes/metrics"] verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"]
-
创建 ClusterRoleBinding:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: outside-prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: outside-prometheus subjects: - kind: ServiceAccount name: outside-prometheus namespace: monitoring
-
应用 RBAC 配置:
kubectl apply -f rbac.yaml
获取 ServiceAccount 的 Token
-
获取 Token:
TOKEN=$(kubectl get secret $(kubectl -n monitoring get secret | awk '/outside-prometheus/{print $1}') -n monitoring -o jsonpath={.data.token} | base64 -d) echo $TOKEN
-
将 Token 保存到 Prometheus 节点:
echo $TOKEN > /usr/local/prometheus/kubernetes-api-token
-
复制 Kubernetes CA 证书到 Prometheus 节点:
scp /etc/kubernetes/pki/ca.crt <prometheus_node>:/usr/local/prometheus/
配置 Prometheus
-
修改 Prometheus 配置文件:
scrape_configs: # 集群 API Server 自动发现 - job_name: 'kubenetes-apiserver' kubernetes_sd_configs: - role: endpoints api_server: https://192.168.80.10:6443 # API Server 地址 tls_config: ca_file: /usr/local/prometheus/ca.crt # Kubernetes CA 证书 authorization: credentials_file: /usr/local/prometheus/kubernetes-api-token # Token 文件 scheme: https tls_config: ca_file: /usr/local/prometheus/ca.crt authorization: credentials_file: /usr/local/prometheus/kubernetes-api-token relabel_configs: - source_labels: ["__meta_kubernetes_namespace", "__meta_kubernetes_endpoints_name", "__meta_kubernetes_endpoint_port_name"] regex: default;kubernetes;https action: keep # 集群节点自动发现 - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node api_server: https://192.168.80.10:6443 tls_config: ca_file: /usr/local/prometheus/ca.crt authorization: credentials_file: /usr/local/prometheus/kubernetes-api-token relabel_configs: - source_labels: ["__address__"] regex: (.*):10250 action: replace target_label: __address__ replacement: $1:9100 - action: labelmap regex: __meta_kubernetes_node_label_(.+)
-
重载 Prometheus 配置:
curl -X POST http://localhost:9090/-/reload
验证
-
访问 Prometheus Web UI:
- 打开
http://<Prometheus_IP>:9090/targets
,检查 Targets 状态是否为UP
。
- 打开
-
检查服务发现:
- 确保 Prometheus 能够正确发现 Kubernetes 集群的节点、Pod、Service 等资源。
关键点说明
-
RBAC 授权:
- 为 Prometheus 分配足够的权限,确保其能够访问 Kubernetes API Server 的资源。
-
Token 和 CA 证书:
- Token 用于认证,CA 证书用于验证 API Server 的 TLS 证书。
-
Relabel 配置:
- 通过
relabel_configs
重写标签,确保 Prometheus 能够正确抓取目标数据。
- 通过
-
服务发现类型:
role: endpoints
:用于发现 Kubernetes 服务的 Endpoints。role: node
:用于发现 Kubernetes 集群的节点。
注意:
- 网络互通:确保 Prometheus 节点能够访问 Kubernetes API Server。
- 安全性:Token 和 CA 证书需妥善保管,避免泄露。
- 性能:大规模集群中,Prometheus 的服务发现和抓取性能需优化。