当前位置：首页 > news >正文

【云平台监控】Prometheus 监控平台部署与应用

news 来源：原创 2025/6/17 3:14:05

文章目录

Prometheus 监控系统
- 概述
- TSDB 存储引擎特点
- 核心特点
- 生态组件
- 工作流程
- 局限性
部署 Prometheus
- 1. Prometheus Server 部署
- 2. 部署 Exporters
- 3. 部署 Grafana
- 4. 服务发现
Kubernetes 集群部署 Prometheus 和 Grafana 全流程指南
- 1. 环境准备
- 2. 部署 Node Exporter
- - 功能：采集节点资源指标（CPU、内存、磁盘等）
  - 步骤：
- 3. 部署 Prometheus
- - 功能：时序数据库 + 告警规则引擎
  - 步骤：
- 4. 部署 Grafana
- - 功能：可视化监控数据
  - 步骤：
- 5. 部署 Alertmanager（邮件告警）
- - 功能：接收 Prometheus 告警并发送通知
  - 步骤：
- 6. 关键配置说明
- 7. 验证与测试
- 总结
配置
Kubernetes 集群外部署 Prometheus 并基于 API Server 实现服务发现
- 步骤
- - 创建 RBAC 授权
  - 获取 ServiceAccount 的 Token
  - 配置 Prometheus
  - 验证
- 关键点说明

Prometheus 监控系统

概述

定义：开源服务监控系统 & 时序数据库，提供通用数据模型、高效采集、存储和查询接口。
核心组件：Prometheus Server
- 数据采集：定期从静态配置或服务发现的目标中拉取数据（HTTP Pull，默认15秒/次）。
- 数据存储：内存暂存后持久化到磁盘（默认2小时回刷），支持WAL日志崩溃恢复。
- 告警处理：根据规则计算指标，触发告警后发送至Alertmanager。
数据源：
- Exporter：部署于被监控主机，暴露HTTP接口供Prometheus拉取指标。
- Pushgateway：接收短期任务主动推送的数据，供Prometheus拉取。
服务发现：支持静态配置或动态发现（如Kubernetes API集成）。

TSDB 存储引擎特点

适用场景：
- 海量时序数据存储，顺序写入为主。
- 数据按时间排序，极少更新。
- 块删除（按时间范围），非随机删除。
- 大体积数据，顺序读取为主，缓存效率低。

核心特点

多维数据模型：时间序列由指标名称 + 键值对标签标识。
内置时序数据库（TSDB）：高效存储本地短期数据（默认15天）。
PromQL查询语言：支持复杂多维查询。
混合采集模式：
- 默认HTTP Pull拉取。
- 通过Pushgateway支持Push模式。
动态服务发现：集成Kubernetes等系统，自动管理监控目标。
可视化集成：与Grafana无缝对接，提供仪表盘展示。

生态组件

组件	功能
Prometheus Server	核心组件，负责数据采集、存储、告警规则计算及PromQL查询。
Client Library	为应用提供内置指标采集SDK（如Go、Java等语言支持）。
Exporters	采集第三方系统指标并转换为Prometheus格式。常见Exporter：
	- Node Exporter：主机资源监控（CPU/内存/磁盘等）。
	- cAdvisor：容器资源监控。
	- kube-state-metrics：K8S资源对象状态监控（Pod/Deployment等）。
	- Blackbox Exporter：网络探测（HTTP/TCP/ICMP等）。
Service Discovery	动态发现监控目标（支持Kubernetes、Consul、DNS等）。
Alertmanager	告警路由、去重、静默，支持邮件/钉钉/企业微信等通知渠道。
Pushgateway	临时存储短期任务的推送数据，供Prometheus拉取。
Grafana	可视化平台，集成Prometheus数据源，提供丰富的监控仪表盘。

工作流程

数据采集：Prometheus Server 基于服务发现或静态配置获取目标，并通过 exporter 拉取指标数据。
数据存储：采集到的数据通过 TSDB 存储到本地存储设备中。
告警生成：根据配置的告警规则，触发告警并发送到 Alertmanager。
告警通知：Alertmanager 发送告警通知到指定的接收方，如邮件、钉钉或企业微信。
数据查询：通过 Prometheus 自带的 Web UI 界面和 PromQL 查询语言查询监控数据。
数据可视化：Grafana 接入 Prometheus 数据源，将监控数据以图形化形式展示。

局限性

数据场景限制：适合指标监控，不适用于日志/事件存储。
长期存储：本地存储设计为短期保留，需借助InfluxDB等外部方案。
集群支持：原生集群机制不完善，可通过Thanos/Cortex实现高可用。
官网：prometheus.io
GitHub：github.com/prometheus
Node Exporter指标说明：链接

部署 Prometheus

1. Prometheus Server 部署

安装 Prometheus

上传并解压 Prometheus 安装包：

cd /opt/
tar xf prometheus-2.35.0.linux-amd64.tar.gz
mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus

配置文件 prometheus.yml 解析：

global:
  scrape_interval: 15s      # 数据采集间隔
  evaluation_interval: 15s  # 告警规则评估间隔
  scrape_timeout: 10s       # 数据采集超时时间

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

配置系统启动文件

创建 prometheus.service：

cat > /usr/lib/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/usr/local/prometheus/data/ \
--storage.tsdb.retention.time=15d \
--web.enable-lifecycle

ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动 Prometheus

systemctl start prometheus
systemctl enable prometheus
netstat -natp | grep :9090

访问 http://<Prometheus_IP>:9090 查看 Web UI。

2. 部署 Exporters

Node Exporter（系统级监控）

安装 Node Exporter：

cd /opt/
tar xf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin

配置启动文件：

cat > /usr/lib/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.ntp \
--collector.mountstats \
--collector.systemd \
--collector.tcpstat

ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动 Node Exporter：

systemctl start node_exporter
systemctl enable node_exporter
netstat -natp | grep :9100

修改 Prometheus 配置文件，添加 Node Exporter 目标：

- job_name: nodes
  metrics_path: "/metrics"
  static_configs:
    - targets:
        - 192.168.80.30:9100
        - 192.168.80.11:9100
        - 192.168.80.12:9100
      labels:
        service: kubernetes

重新加载配置：

curl -X POST http://192.168.80.30:9090/-/reload

MySQL Exporter

安装 MySQL Exporter：

cd /opt/
tar xf mysqld_exporter-0.14.0.linux-amd64.tar.gz
mv mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter /usr/local/bin/

配置 MySQL 用户权限：

GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost' IDENTIFIED BY 'abc123';

启动 MySQL Exporter：

systemctl start mysqld_exporter
systemctl enable mysqld_exporter
netstat -natp | grep :9104

修改 Prometheus 配置文件，添加 MySQL Exporter 目标：

- job_name: mysqld
  metrics_path: "/metrics"
  static_configs:
    - targets:
        - 192.168.80.15:9104
      labels:
        service: mysqld

Nginx Exporter

安装 Nginx 和 Nginx VTS 模块：

./configure --prefix=/usr/local/nginx \
--add-module=/usr/local/nginx-module-vts
make && make install

配置 Nginx：

http {
    vhost_traffic_status_zone;
    vhost_traffic_status_filter_by_host on;
    server {
        listen 8080;
        location /status {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
        }
    }
}

启动 Nginx Exporter：

systemctl start nginx-exporter
systemctl enable nginx-exporter
netstat -natp | grep :9913

修改 Prometheus 配置文件，添加 Nginx Exporter 目标：

- job_name: nginx
  metrics_path: "/metrics"
  static_configs:
    - targets:
        - 192.168.80.15:9913
      labels:
        service: nginx

3. 部署 Grafana

安装 Grafana

yum install -y grafana-7.4.0-1.x86_64.rpm
systemctl start grafana-server
systemctl enable grafana-server
netstat -natp | grep :3000

访问 http://<Grafana_IP>:3000，默认账号：admin/admin。

配置数据源
- 添加 Prometheus 数据源，URL 为 http://<Prometheus_IP>:9090。
导入监控面板
- 从 Grafana Dashboards 下载模板，导入到 Grafana。

4. 服务发现

基于文件的服务发现

创建目标文件：

- targets:
    - 192.168.80.30:9100
    - 192.168.80.15:9100
  labels:
    app: node-exporter
    job: node

修改 Prometheus 配置文件：

- job_name: nodes
  file_sd_configs:
    - files:
        - targets/node*.yaml
      refresh_interval: 2m

基于 Consul 的服务发现

启动 Consul：

consul agent -server -bootstrap -ui -data-dir=/usr/local/consul/data -bind=192.168.80.14 -client=0.0.0.0 -node=consul-server01 &

注册服务：

{
  "services": [
    {
      "id": "node_exporter-node01",
      "name": "node01",
      "address": "192.168.80.30",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.80.30:9100/metrics",
        "interval": "5s"
      }]
    }
  ]
}

修改 Prometheus 配置文件：

- job_name: nodes
  consul_sd_configs:
    - server: 192.168.80.30:8500
      tags:
        - nodes
      refresh_interval: 2m

Kubernetes 集群部署 Prometheus 和 Grafana 全流程指南

1. 环境准备

集群节点：
- 控制节点/master01：192.168.80.10
- 工作节点/node01：192.168.80.11
- 工作节点/node02：192.168.80.12
命名空间：monitor-sa（用于监控组件）

2. 部署 Node Exporter

功能：采集节点资源指标（CPU、内存、磁盘等）

步骤：

创建命名空间：
```
kubectl create ns monitor-sa
```

部署 DaemonSet：

# node-export.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor-sa
  labels:
    name: node-exporter
spec:
  selector:
    matchLabels:
      name: node-exporter
  template:
    metadata:
      labels:
        name: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v0.16.0
        ports:
        - containerPort: 9100
        args:
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        volumeMounts:
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

kubectl apply -f node-export.yaml

验证：

kubectl get pods -n monitor-sa -o wide
curl http://<节点IP>:9100/metrics  # 检查指标数据

3. 部署 Prometheus

功能：时序数据库 + 告警规则引擎

步骤：

创建 ServiceAccount 和 RBAC 授权：

kubectl create serviceaccount monitor -n monitor-sa
kubectl create clusterrolebinding monitor-clusterrolebinding \
  --clusterrole=cluster-admin \
  --serviceaccount=monitor-sa:monitor

创建 Prometheus 配置 ConfigMap：

# prometheus-cfg.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: prometheus-config
  namespace: monitor-sa
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

kubectl apply -f prometheus-cfg.yaml

部署 Prometheus：

# prometheus-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor-sa
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: monitor
      containers:
      - name: prometheus
        image: prom/prometheus:v2.35.0
        args:
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/prometheus
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus-config

kubectl apply -f prometheus-deploy.yaml

创建 Service 暴露端口：

# prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor-sa
spec:
  type: NodePort
  ports:
  - port: 9090
    nodePort: 31000
  selector:
    app: prometheus

kubectl apply -f prometheus-svc.yaml

访问 Web UI：
```
http://<NodeIP>:31000
```

4. 部署 Grafana

功能：可视化监控数据

步骤：

部署 Grafana：

# grafana.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitor-sa
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:9.0.0
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitor-sa
spec:
  type: NodePort
  ports:
  - port: 3000
    nodePort: 32000
  selector:
    app: grafana

kubectl apply -f grafana.yaml

配置数据源：
- 访问 http://<NodeIP>:32000，默认账号 admin/admin
- 添加 Prometheus 数据源：http://prometheus.monitor-sa.svc:9090
导入仪表盘：
- 从 Grafana Dashboards 搜索模板（如 3119 或 315），导入 JSON。

5. 部署 Alertmanager（邮件告警）

功能：接收 Prometheus 告警并发送通知

步骤：

创建 Alertmanager 配置：

# alertmanager-cm.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: monitor-sa
data:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.qq.com:465'
      smtp_from: 'your-email@qq.com'
      smtp_auth_username: 'your-email@qq.com'
      smtp_auth_password: 'your-smtp-token'  # QQ邮箱授权码
    route:
      group_by: [alertname]
      receiver: 'default'
    receivers:
    - name: 'default'
      email_configs:
      - to: 'alert-receiver@example.com'

kubectl apply -f alertmanager-cm.yaml

部署 Alertmanager：

# alertmanager-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitor-sa
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.24.0
        args:
        - --config.file=/etc/alertmanager/alertmanager.yml
        volumeMounts:
        - name: config
          mountPath: /etc/alertmanager
      volumes:
      - name: config
        configMap:
          name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitor-sa
spec:
  type: NodePort
  ports:
  - port: 9093
    nodePort: 30066
  selector:
    app: alertmanager

kubectl apply -f alertmanager-deploy.yaml

配置 Prometheus 告警规则：

# 在 prometheus-cfg.yaml 中添加告警规则
rule_files:
  - "alert-rules.yml"

6. 关键配置说明

服务发现：
- Kubernetes 原生支持：通过 kubernetes_sd_configs 自动发现节点、Pod、Service。
- Relabel 配置：重写标签以适配监控目标。
告警路由：
- 分组与抑制：通过 group_by 和 repeat_interval 避免告警洪泛。
- 邮件模板：可自定义 HTML 模板提升告警可读性。
高可用建议：
- Prometheus 联邦：跨集群数据聚合。
- Thanos/Cortex：长期存储与查询优化。

7. 验证与测试

触发测试告警：
- 手动关闭某个节点上的 node-exporter，观察 Prometheus 的 Targets 状态。
- 检查 Alertmanager 界面（http://<NodeIP>:30066）是否收到告警。
邮件接收：
- 确保邮箱配置正确，检查垃圾邮件文件夹。

总结

通过以上步骤，可在 Kubernetes 集群中快速搭建 Prometheus + Grafana + Alertmanager 的全栈监控告警系统。关键点包括：

自动发现：利用 Kubernetes 原生机制动态监控资源。
可视化：通过 Grafana 仪表盘实时展示数据。
告警闭环：结合 Alertmanager 实现邮件通知，确保问题及时响应。

配置

kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: prom
data:
  prometheus.yml: |
    # A scrape configuration for running Prometheus on a Kubernetes cluster.
    # This uses separate scrape configs for cluster components (i.e. API server, node)
    # and services to allow each to use different authentication configs.
    #
    # Kubernetes labels will be added as Prometheus labels on metrics via the
    # `labelmap` relabeling action.
    #
    # If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
    # for the kubernetes-cadvisor job; you will need to edit or remove this job.
    # Scrape config for API servers.
    #
    # Kubernetes exposes API servers as endpoints to the default/kubernetes
    # service so this uses `endpoints` role and uses relabelling to only keep
    # the endpoints associated with the default/kubernetes service using the
    # default named port `https`. This works for single API server deployments as
    # well as HA API server deployments.
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # If your node certificates are self-signed or use a different CA to the
        # master CA, then disable certificate verification below. Note that
        # certificate verification is an integral part of a secure infrastructure
        # so this should only be disabled in a controlled environment. You can
        # disable certificate verification by uncommenting the line below.
        #
        # insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      # Keep only the default/kubernetes service endpoints for the https port. This
      # will add targets for each API server which Kubernetes adds an endpoint to
      # the default/kubernetes service.
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    # Scrape config for nodes (kubelet).
    #
    # Rather than connecting directly to the node, the scrape is proxied though the
    # Kubernetes apiserver.  This means it will work if Prometheus is running out of
    # cluster, or can't connect to nodes for some other reason (e.g. because of
    # firewalling).
    - job_name: 'kubernetes-nodes'
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics
    # Scrape config for Kubelet cAdvisor.
    #
    # This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
    # (those whose names begin with 'container_') have been removed from the
    # Kubelet metrics endpoint.  This job scrapes the cAdvisor endpoint to
    # retrieve those metrics.
    #
    # In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
    # HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
    # in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
    # the --cadvisor-port=0 Kubelet flag).
    #
    # This job is not necessary and should be removed in Kubernetes 1.6 and
    # earlier versions, or it will cause the metrics to be scraped twice.
    - job_name: 'kubernetes-cadvisor'
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    # Scrape config for service endpoints.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
    # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
    # to set this to `https` & most likely set the `tls_config` of the scrape config.
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: If the metrics are exposed on a different port to the
    # service then set this appropriately.
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name
    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
      # if you want to use metrics on jobs, set the below field to
      # true to prevent Prometheus from setting the `job` label
      # automatically.
      honor_labels: false
      kubernetes_sd_configs:
      - role: pod
      # skip verification so you can do HTTPS to pods
      tls_config:
        insecure_skip_verify: true
      # make sure your labels are in order
      relabel_configs:
      # these labels tell Prometheus to automatically attach source
      # pod and namespace information to each collected sample, so
      # that they'll be exposed in the custom metrics API automatically.
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      # these labels tell Prometheus to look for
      # prometheus.io/{scrape,path,port} annotations to configure
      # how to scrape
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (.+)

Kubernetes 集群外部署 Prometheus 并基于 API Server 实现服务发现

场景：Prometheus 部署在 Kubernetes 集群外部，需要监控集群内的资源。
挑战：集群外 Prometheus 无法直接访问集群内的资源，需要通过 Kubernetes API Server 实现服务发现。

步骤

创建 RBAC 授权

创建 Namespace：

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

创建 ServiceAccount：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: outside-prometheus
  namespace: monitoring

创建 ClusterRole：

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: outside-prometheus
rules:
- apiGroups: [""]
  resources: ["nodes", "services", "endpoints", "pods", "nodes/proxy"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps", "nodes/metrics"]
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

创建 ClusterRoleBinding：

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: outside-prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: outside-prometheus
subjects:
- kind: ServiceAccount
  name: outside-prometheus
  namespace: monitoring

应用 RBAC 配置：
```
kubectl apply -f rbac.yaml
```

获取 ServiceAccount 的 Token

获取 Token：

TOKEN=$(kubectl get secret $(kubectl -n monitoring get secret | awk '/outside-prometheus/{print $1}') -n monitoring -o jsonpath={.data.token} | base64 -d)
echo $TOKEN

将 Token 保存到 Prometheus 节点：

echo $TOKEN > /usr/local/prometheus/kubernetes-api-token

复制 Kubernetes CA 证书到 Prometheus 节点：

scp /etc/kubernetes/pki/ca.crt <prometheus_node>:/usr/local/prometheus/

配置 Prometheus

修改 Prometheus 配置文件：

scrape_configs:
  # 集群 API Server 自动发现
  - job_name: 'kubenetes-apiserver'
    kubernetes_sd_configs:
    - role: endpoints
      api_server: https://192.168.80.10:6443  # API Server 地址
      tls_config:
        ca_file: /usr/local/prometheus/ca.crt  # Kubernetes CA 证书
      authorization:
        credentials_file: /usr/local/prometheus/kubernetes-api-token  # Token 文件
    scheme: https
    tls_config:
      ca_file: /usr/local/prometheus/ca.crt
    authorization:
      credentials_file: /usr/local/prometheus/kubernetes-api-token
    relabel_configs:
    - source_labels: ["__meta_kubernetes_namespace", "__meta_kubernetes_endpoints_name", "__meta_kubernetes_endpoint_port_name"]
      regex: default;kubernetes;https
      action: keep

  # 集群节点自动发现
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
      api_server: https://192.168.80.10:6443
      tls_config:
        ca_file: /usr/local/prometheus/ca.crt
      authorization:
        credentials_file: /usr/local/prometheus/kubernetes-api-token
    relabel_configs:
    - source_labels: ["__address__"]
      regex: (.*):10250
      action: replace
      target_label: __address__
      replacement: $1:9100
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

重载 Prometheus 配置：

curl -X POST http://localhost:9090/-/reload

验证

访问 Prometheus Web UI：
- 打开 http://<Prometheus_IP>:9090/targets，检查 Targets 状态是否为 UP。
检查服务发现：
- 确保 Prometheus 能够正确发现 Kubernetes 集群的节点、Pod、Service 等资源。

关键点说明

RBAC 授权：
- 为 Prometheus 分配足够的权限，确保其能够访问 Kubernetes API Server 的资源。
Token 和 CA 证书：
- Token 用于认证，CA 证书用于验证 API Server 的 TLS 证书。
Relabel 配置：
- 通过 relabel_configs 重写标签，确保 Prometheus 能够正确抓取目标数据。
服务发现类型：
- role: endpoints：用于发现 Kubernetes 服务的 Endpoints。
- role: node：用于发现 Kubernetes 集群的节点。