当前位置: 首页 > news >正文

Prometheus+Grafana+AlertManager完整安装过程

文章目录

  • 1.概述
  • 2.被监控服务器相关软件安装
    • 2.1 Docker、Docker-Compose安装
    • 2.2 基础软件、采集软件容器化部署
  • 3.监控服务器核心组件安装
    • 3.1.Prometheus安装
      • 3.1.1 安装流程
      • 3.1.2 修改prometheus.yml
      • 3.1.3 添加监控规则文件
      • 3.1.4 自定义Systemctl
      • 3.1.5 启动Prometheus
      • 3.1.6.访问Prometheus后台
      • 3.1.7 检查target是否已经加载
      • 3.1.8 检查rules是否已经加载
      • 3.1.9.备注说明
    • 3.2.Grafana安装
      • 3.2.1.安装流程
      • 3.2.2 自定义Systemctl
      • 3.2.3 启动Grafana
      • 3.2.4 访问Grafana后台
      • 3.2.5 添加Prometheus数据源
      • 3.2.6 添加Dashboard(服务器监控仪表盘)
      • 3.2.7 添加Dashboard(容器监控仪表盘)
      • 3.2.8 添加JAVA监控(JVM监控仪表盘)
      • 3.2.9 添加Mysql监控(Mysql监控仪表盘)
      • 3.2.10 添加Nginx监控(Nginx监控仪表盘)
      • 3.2.11 添加Redis监控(Redis监控仪表盘)
      • 3.2.12 添加黑盒监控(黑盒监控仪表盘)
    • 3.3.AlertManager安装
      • 3.3.1.安装流程
      • 3.3.2.修改alertmanager.yml
      • 3.3.3.邮件通知相关配置
        • 3.3.3.1 获取163邮件授权码
        • 3.3.3.2 编定义邮件模板(可选)
      • 3.3.4.企业微信通知相关配置
        • 3.3.4.1.获取企业微信机器人webhook
        • 3.3.4.2.配置企业微信机器人告警通知服务
      • 3.3.5 自定义服务通知相关配置(springboot为例)
        • 3.3.5.1 修改pom.xml
        • 3.3.5.2 修改application.yml
        • 3.3.5.3 添加webhook接口
      • 3.3.6 自定义Systemctl
      • 3.3.7 启动AlertManager
      • 3.3.8 告警效果展示
        • 3.3.8.1 邮件告警效果展示
        • 3.3.8.2 企业微信告警效果展示

1.概述

服务器资源情况:

服务器名IPCPU内存描述
prometheus192.168.25.4111.9G安装Prometheus、Grafana、AlertManager。模拟监控的服务器
prometheus-monitor-node192.168.25.4211.9G安装一些基础软件、应用服务。模拟被监控的服务器

192.168.25.41 监控服务器上软件安装情况:

IP端口软件名称版本安装方式完整安装包名称备注
192.168.25.419090prometheus3.1.0二进制方式prometheus-3.1.0.linux-amd64.tar.gz监控的核心组件
192.168.25.413000grafana11.5.1二进制方式grafana-enterprise-11.5.1.linux-amd64.tar.gz提供图形化监控数据展示
192.168.25.419093alertmanager0.28.0二进制方式alertmanager-0.28.0.linux-amd64.tar.gz告警通知
192.168.25.419100node_exporter1.8.2二进制方式node_exporter-1.8.2.linux-amd64.tar.gz(可选安装)对服务器资源进行监控,并把数据提供给Prometheus。

192.168.25.42 被监控服务器上软件安装情况:

IP端口软件名称版本安装方式备注
192.168.25.42docker24.0.2二进制方式用于运行其他软件、服务
192.168.25.42docker-composev2.5.0二进制方式
192.168.25.4280nginx1.27.4容器方式Nginx服务
192.168.25.426379redis6.2.17容器方式Redis服务
192.168.25.423306mysql6.2.17容器方式MySql服务
192.168.25.428081java-web-demo0.0.4容器方式java应用服务(这是本人私有的镜像)
192.168.25.428080cadvisorv0.33.0容器方式用于采集Docker的监控数据
192.168.25.429091pushgatewayv1.11.0容器方式用户可将采集数据发到pushgateway,由pushgatway将采集数据推到Prometheus
192.168.25.428080cadvisorv0.33.0容器方式用于采集Docker的监控数据
192.168.25.429100node_exporterv1.5.0容器方式用于采集服务器的监控数据
192.168.25.429113nginx_exporterv1.5.0容器方式用于采集Nginx的监控数据
192.168.25.429121redis_exporterv1.5.0容器方式用于采集Redis的监控数据
192.168.25.429104mysqld-exporterv1.5.0容器方式用于采集MySql的监控数据(也可监控Mariadb)
192.168.25.429100blackbox_exporter0.25.0容器方式采集外部地址数据。如http、tcp、icpm等

注:软件下载地址:https://prometheus.io/download/

注:软件下载地址:https://prometheus.io/download/

2.被监控服务器相关软件安装

2.1 Docker、Docker-Compose安装

参考安装文档:Centos7.9离线安装Docker24(无坑版)

或直接使用:docker-v1.24 一键部署脚本

2.2 基础软件、采集软件容器化部署

创建docker-compose.yaml文件
内容如下:

version: '3.3'volumes:prometheus_data: {}grafana_data: {}networks:monitoring:driver: bridgeservices:nginx:image: nginx:1.27.4container_name: nginxrestart: alwaysports:- 80:80volumes:- /etc/localtime:/etc/localtime:ro- /root/default.conf:/etc/nginx/conf.d/default.confenvironment:- TZ=Asia/Shanghairedis:image: redis:6.2.17container_name: rediscommand: redis-server --requirepass 123456 --maxmemory 512mbrestart: always#volumes:#  - /data/redis/data:/dataports:- 6379:6379db:image: mysql:8.0.37restart: alwayscontainer_name: mysqlenvironment:TZ: Asia/ShanghaiLANG: en_US.UTF-8MYSQL_ROOT_PASSWORD: 123456command:--default-authentication-plugin=mysql_native_password--character-set-server=utf8mb4--collation-server=utf8mb4_general_ci--lower_case_table_names=1--performance_schema=1--sql-mode=""--skip-log-binvolumes:- /data/mysql/conf:/etc/mysql/conf.d #数据文件挂载- /data/mysql/data:/var/lib/mysql #数据文件挂载ports:- 3306:3306java-web-demo:image: java-web-demo:0.0.4container_name: springboot-web-demo restart: alwaysports:- 8081:8080 pushgateway:image: prom/pushgateway:v1.11.0container_name: pushgatewayrestart: alwaysports:- "9091:9091"cadvisor:image: google/cadvisor:v0.33.0container_name: cadvisorrestart: alwaysprivileged: trueports:- 8080:8080volumes:- /etc/localtime:/etc/localtime:ro- /:/rootfs:ro- /var/run:/var/run:rw- /sys:/sys:ro- /var/lib/docker/:/var/lib/docker:roenvironment:- TZ=Asia/Shanghainetworks:- monitoringnode_exporter:image: prom/node-exporter:v1.5.0container_name: node-exporterrestart: alwaysports:- 9100:9100network_mode: "host"volumes:- /etc/localtime:/etc/localtime:ro- /proc:/host/proc:ro- /sys:/host/sys:ro- /:/rootfs:roenvironment:TZ: Asia/Shanghaicommand: - '--web.listen-address=:9100'- '--path.procfs=/host/proc' - '--path.sysfs=/host/sys'- "--path.rootfs=/rootfs"- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'nginx_exporter:image: nginx/nginx-prometheus-exporter:0.11container_name: nginx_exporterhostname: nginx_exportercommand:- '-nginx.scrape-uri=http://192.168.25.42/stub_status'restart: alwaysports:- "9113:9113"redis_exporter:image: oliver006/redis_exportercontainer_name: redis_exporterrestart: alwaysenvironment:REDIS_ADDR: "192.168.25.42:6379"REDIS_PASSWORD: 123456ports:- "9121:9121"mysqld-exporter:image: prom/mysqld-exportercontainer_name: mysqld-exporterrestart: alwayscommand:- '--collect.info_schema.processlist'- '--collect.info_schema.innodb_metrics'- '--collect.info_schema.tablestats'- '--collect.info_schema.tables'- '--collect.info_schema.userstats'- '--collect.engine_innodb_status'- '--config.my-cnf=/my.cnf'volumes:- /root/my.cnf:/my.cnfports:- 9104:9104blackbox_exporter:image: prom/blackbox-exporter:v0.25.0container_name: blackbox_exporterrestart: alwaysvolumes:- /data/blackbox_exporter:/etc/blackbox_exporterports:- 9115:9115

上面数据卷挂载将宿主机中指定文件映射到容器,故而宿主机上还需要有相关文件或者文件夹,具体如下:

Nginx的default.conf文件完整内容如下:

[root@prometheus-monitor-node ~]# cat default.conf
server {listen       80;listen  [::]:80;server_name  localhost;#access_log  /var/log/nginx/host.access.log  main;location / {root   /usr/share/nginx/html;index  index.html index.htm;}location /stub_status {stub_status on;access_log off;#allow nginx_export的ip;    allow 0.0.0.0/0;deny all;}#error_page  404              /404.html;# redirect server error pages to the static page /50x.html#error_page   500 502 503 504  /50x.html;location = /50x.html {root   /usr/share/nginx/html;}
}

Mysql的my.conf文件完整内容

[root@prometheus-monitor-node ~]# cat my.cnf 
[client]
host=192.168.25.42
user=exporter
password=123456
port=3306

黑盒blackbox_exporter的config.yml文件完整内容如下

  • 注:/data/blackbox_exporter只有一个config.yml文件,没有其他文件了。
mkdir -p /data/blackbox_exporter
[root@prometheus-monitor-node ~]# cat /data/blackbox_exporter/config.yml 
modules:http_2xx:prober: httphttp:method: GEThttp_post_2xx:prober: httphttp:method: POSTtcp_connect:prober: tcppop3s_banner:prober: tcptcp:query_response:- expect: "^+OK"tls: truetls_config:insecure_skip_verify: falsegrpc:prober: grpcgrpc:tls: truepreferred_ip_protocol: "ip4"grpc_plain:prober: grpcgrpc:tls: falseservice: "service1"ssh_banner:prober: tcptcp:query_response:- expect: "^SSH-2.0-"- send: "SSH-2.0-blackbox-ssh-check"irc_banner:prober: tcptcp:query_response:- send: "NICK prober"- send: "USER prober prober prober :prober"- expect: "PING :([^ ]+)"send: "PONG ${1}"- expect: "^:[^ ]+ 001"icmp:prober: icmpicmp_ttl5:prober: icmptimeout: 5sicmp:ttl: 5

运行相关服务

docker -f docker-compose.yaml up -d

最终效果如下:

[root@prometheus-monitor-node ~]# docker ps
CONTAINER ID   IMAGE                                  COMMAND                   CREATED          STATUS                    PORTS                                                  NAMES
e45f0d5ece57   google/cadvisor:v0.33.0                "/usr/bin/cadvisor -…"   32 minutes ago   Up 32 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              cadvisor
394ca6fc8915   prom/pushgateway:v1.11.0               "/bin/pushgateway"        3 days ago       Up 55 minutes             0.0.0.0:9091->9091/tcp, :::9091->9091/tcp              pushgateway
f5e38277ceca   prom/blackbox-exporter:v0.25.0         "/bin/blackbox_expor…"   7 days ago       Up 55 minutes             0.0.0.0:9115->9115/tcp, :::9115->9115/tcp              blackbox_exporter
1e38cd21cf0e   nginx:1.27.4                           "/docker-entrypoint.…"   7 days ago       Up 55 minutes             0.0.0.0:80->80/tcp, :::80->80/tcp                      nginx
bdc5b8b4b4f4   java-web-demo:0.0.4                    "java -jar app.jar"       9 days ago       Up 55 minutes             0.0.0.0:8081->8080/tcp, :::8081->8080/tcp              springboot-web-demo
e0b0e18151ea   prom/mysqld-exporter                   "/bin/mysqld_exporte…"   10 days ago      Up 55 minutes             0.0.0.0:9104->9104/tcp, :::9104->9104/tcp              mysqld-exporter
1528db64726b   redis:6.2.17                           "docker-entrypoint.s…"   10 days ago      Up 55 minutes             0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              redis
82eb7094818c   mysql:8.0.37                           "docker-entrypoint.s…"   10 days ago      Up 55 minutes             0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   mysql
452c7fb8c129   oliver006/redis_exporter               "/redis_exporter"         10 days ago      Up 55 minutes             0.0.0.0:9121->9121/tcp, :::9121->9121/tcp              redis_exporter
c63f2666c72b   nginx/nginx-prometheus-exporter:0.11   "/usr/bin/nginx-prom…"   10 days ago      Up 55 minutes             0.0.0.0:9113->9113/tcp, :::9113->9113/tcp              nginx_exporter
50ca88f777b8   prom/node-exporter:v1.5.0              "/bin/node_exporter …"   10 days ago      Up 55 minutes                                                                    node-exporter

注1:docker方式查看容器列表:docker ps
注2:docker方式重启容器:docker restart 容器名
注3:docker方式查看容器日志:docker logs -f 容器名
注4:docker-compose方式查看容器列表:docker-compose ps
注5:docker-compose重启/重新加载指定服务:docker -f docker-compose.yaml up -d 服务名

3.监控服务器核心组件安装

3.1.Prometheus安装

3.1.1 安装流程

# 下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
# 创建安装目录
mkdir -p /opt/software/prometheus/
# 解压到特定目录
tar -zxvf prometheus-3.1.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/prometheus-3.1.0.linux-amd64 /opt/software/prometheus/prometheus

这是我最终配置完成后的目录结构(如果后面安装看懵了请参考此目录结构):

[root@prometheus prometheus]# tree /opt/software/prometheus/prometheus├── data
├── EOF
├── LICENSE
├── NOTICE
├── prometheus
├── prometheus.yml
├── prometheus.yml_bak
├── promtool
├── rules
│   ├── alert.yml
│   ├── blackbox_exporter.yml
│   ├── docker.yml
│   ├── mysqld.yml
│   ├── nginx.yml
│   ├── node-exporter.yml
│   ├── redis.yml
│   └── springboot.yml
└── targets├── blackbox_http.yml├── blackbox_icmp.yml├── blackbox_tcp.yml├── pushgateway.yml├── springboot.yml└── targets.yml

3.1.2 修改prometheus.yml

:prometheus.yml里面的配置有两种配置方式,一种是静态文件配置方式,第二种是动态文件配置方式。你可以两种都尝试看看,推荐用第二种,

方式一(静态文件方式,每次配置发生变动需手动加载):

修改prometheus.yml,完整内容如下:

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:- 192.168.25.41:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:- "rules/*"# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"static_configs:- targets: ["localhost:9090"]- job_name: "alertmanager"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.41:9093"]- job_name: "node_exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.41:9100"]labels:instance: Prometheus服务器myCusLabel: myCusVal- targets: ["192.168.25.42:9100"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "cadvisor"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:8080"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "nginx-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9113"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "redis-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9121"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "mysql-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9104"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "springboot-demo"scrape_interval: 15smetrics_path: '/actuator/prometheus'static_configs:- targets: ["192.168.25.42:8081"]labels:instance: 被监控服务器42myCusLabel: myCusValmytype: springboot- job_name: "pushgateway"scrape_interval: 15shonor_labels: true  #加上此配置,exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖static_configs: - targets: ["192.168.25.42:9091"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "blackbox_http"metrics_path: /probeparams:module: [http_2xx]static_configs:- targets:- https://www.baidu.com- https://www.jd.comlabels:company: "外部公司"project: "外部项目"env: "dev"- targets: - http://192.168.25.42:8081/labels:describe: springboot-web-demo应用company: "内部公司"project: "内部项目"env: "test"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115- job_name: "blackbox_tcp"metrics_path: /probeparams:module: [tcp_connect]static_configs:- targets: - 192.168.25.42:22- 192.168.25.41:9090labels:company: "内部公司"project: "测试项目"env: "test"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115#icmp检查配置 ping- job_name: "blackbox_icmp"metrics_path: /probeparams:module: [icmp]static_configs:- targets: - 192.168.25.42- 192.168.25.200  # 不存在的iplabels:company: "内部公司"project: "测试项目"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115

方式二(动态文件方式,每次配置发生变动会自动加载):
修改prometheus.yml,完整内容如下:

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:- 192.168.25.41:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:- "rules/*"# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "file-target"file_sd_configs:- refresh_interval: 10sfiles:- targets/targets.yml- job_name: "file-sd-pushgateway"scrape_interval: 15shonor_labels: true  # 为true时,exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖file_sd_configs:- refresh_interval: 10sfiles:- targets/pushgateway.yml    - job_name: "file-application"scrape_interval: 15smetrics_path: '/actuator/prometheus'file_sd_configs:- refresh_interval: 10sfiles:- targets/springboot.yml- job_name: "file-blackbox_http"metrics_path: /probeparams:module: [http_2xx]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_http.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- source_labels: [__param_target]target_label: __param_myparamreplacement: myparamVal- target_label: __address__replacement: 192.168.25.42:9115#- regex: "__(.*)__"    默认__xx__的标签都会隐藏,这两行让__xx__标签替换成xx标签,故而用户可以看到这些xx标签#  action: labelmap- job_name: "file-blackbox_tcp"metrics_path: /probeparams:module: [tcp_connect]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_tcp.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115#icmp检查配置 ping- job_name: "file-blackbox_icmp"metrics_path: /probeparams:module: [icmp]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_icmp.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115

创建targets目录,用于存放动态配置文件。

mkdir -p /opt/software/prometheus/prometheus/targets/

添加/opt/software/prometheus/prometheus/targets/targets.yml文件,完整内容如下:

- targets: ["192.168.25.41:9090"]labels:job: prometheus
- targets: ["192.168.25.41:9093"]labels:job: alertmanager
- targets: ["192.168.25.41:9100"]labels:job: node_exporterinstance: Prometheus服务器myCusLabel: myCusVal
- targets: ["192.168.25.42:9100"]labels:job: node_exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:8080"]labels:job: cadvisorinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9113"]labels:job: nginx-exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9121"]labels:job: redis-exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9104"]labels:job: mysql-exporterinstance: 被监控服务器42myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/springboot.yml文件,内容如下:

- targets: ["192.168.25.42:8081"]labels:job: java-applicationinstance: 被监控服务器42myCusLabel: myCusValmytype: springboot

添加/opt/software/prometheus/prometheus/targets/pushgateway.yml文件,完整内容如下:

- targets: ["192.168.25.42:9091"]labels:job: pushgatewayinstance: 被监控服务器42myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/blackbox_http.yml文件,完整内容如下:

- targets:- https://www.baidu.com- https://www.jd.comlabels:job: blackbox_httpcompany: "外部公司"project: "外部项目"env: "dev"
- targets: - http://192.168.25.42:8081/labels:job: blackbox_httpdescribe: springboot-web-demo应用company: "内部公司"project: "内部项目"env: "dev"

添加/opt/software/prometheus/prometheus/targets/blackbox_tcp.yml文件,完整内容如下:

- targets: - 192.168.25.42	  # 被监控的endpoint- 192.168.25.200  # 被监控的endpointlabels:job: blackbox_icmpcompany: "内部公司"project: "测试项目"

添加/opt/software/prometheus/prometheus/targets/blackbox_icmp.yml文件,完整内容如下:

- targets: - 192.168.25.42	  # 被监控的IP- 192.168.25.200  # 被监控的IPlabels:job: blackbox_icmpcompany: "内部公司"project: "测试项目"env: "dev"

3.1.3 添加监控规则文件

创建规则目录

mkdir /opt/software/prometheus/prometheus/rules/

添加/opt/software/prometheus/prometheus/rules/alert.yml件,完整内容如下:

groups:
- name: Prometheus alertrules:- alert: 服务告警expr: up==0for: 30slabels:serverity: criticalannotations:summary: "服务异常,实例{{$labels.instance}}"description: "{{$labels.job}}服务已关闭"

添加/opt/software/prometheus/prometheus/rules/node-exporter.yml文件,完整内容如下:

groups:- name: node-exporterrules:- alert: HostOutOfMemoryexpr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10for: 2mlabels:severity: warningannotations:summary: "主机内存不足,实例:{{ $labels.instance }}"description: "内存可用率<10%,当前值:{{ $value }}"- alert: HostMemoryUnderMemoryPressureexpr: rate(node_vmstat_pgmajfault[1m]) > 1000for: 2mlabels:severity: warningannotations:summary: "内存压力不足,实例:{{ $labels.instance }}"description: "节点内存压力大。 重大页面错误率高,当前值为:{{ $value }}"- alert: HostUnusualNetworkThroughputInexpr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100for: 5mlabels:severity: warningannotations:summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}"description: "网络流入流量 > 100 MB/s,当前值:{{ $value }}"- alert: HostUnusualNetworkThroughputOutexpr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100for: 5mlabels:severity: warningannotations:summary: "异常流出网络吞吐量,实例:{{ $labels.instance }}"description: "网络流出流量 > 100 MB/s,当前值为:{{ $value }}"- alert: HostUnusualDiskReadRateexpr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50for: 5mlabels:severity: warningannotations:summary: "异常磁盘读取,实例:{{ $labels.instance }}"description: "磁盘读取> 50 MB/s,当前值:{{ $value }}"- alert: HostUnusualDiskWriteRateexpr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50for: 2mlabels:severity: warningannotations:summary: "异常磁盘写入,实例:{{ $labels.instance }}"description: "磁盘写入> 50 MB/s,当前值:{{ $value }}"- alert: HostOutOfDiskSpaceexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0for: 2mlabels:severity: warningannotations:summary: "磁盘空间不足告警,实例:{{ $labels.instance }}"description: "剩余磁盘空间< 10% ,当前值:{{ $value }}"- alert: HostDiskWillFillIn24Hoursexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0for: 2mlabels:severity: warningannotations:summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}"description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽,当前值:{{ $value }}"- alert: HostOutOfInodesexpr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0for: 2mlabels:severity: warningannotations:summary: "磁盘Inodes不足,实例:{{ $labels.instance }}"description: "剩余磁盘 inodes < 10%,当前值: {{ $value }}"- alert: HostUnusualDiskReadLatencyexpr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0for: 2mlabels:severity: warningannotations:summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}"description: "磁盘读取延迟 > 100ms,当前值:{{ $value }}"- alert: HostUnusualDiskWriteLatencyexpr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0for: 2mlabels:severity: warningannotations:summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}"description: "磁盘写入延迟 > 100ms,当前值:{{ $value }}"- alert: high_load expr: node_load1 > 4for: 2mlabels:severity: pageannotations:summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}"description: "CPU1分钟负载>4,已经持续2分钟。当前值为:{{ $value }}"- alert: HostCpuIsUnderUtilizedexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80for: 1mlabels:severity: warningannotations:summary: "cpu负载高,实例:{{ $labels.instance }}"description: "cpu负载> 80%,当前值:{{ $value }}"- alert: HostCpuStealNoisyNeighborexpr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10for: 0mlabels:severity: warningannotations:summary: "CPU窃取率异常,实例:{{ $labels.instance }}"description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能,或者 Spot 实例可能失去信用,当前值:{{ $value }}"- alert: HostSwapIsFillingUpexpr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80for: 2mlabels:severity: warningannotations:summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}"description: "磁盘swap空间使用率>80%"- alert: HostNetworkReceiveErrorsexpr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01for: 2mlabels:severity: warningannotations:summary: "异常网络接收错误,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01,当前值:{{ $value }}"- alert: HostNetworkTransmitErrorsexpr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01for: 2mlabels:severity: warningannotations:summary: "异常网络传输错误,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01,当前值:{{ $value }}"- alert: HostNetworkInterfaceSaturatedexpr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000for: 1mlabels:severity: warningannotations:summary: "异常网络接口饱和,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}正在超载,当前值{{ $value }}"- alert: HostConntrackLimitexpr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8for: 5mlabels:severity: warningannotations:summary: "异常连接数,实例:{{ $labels.instance }}"description: "连接数过大,当前连接数:{{ $value }}"- alert: HostClockSkewexpr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)for: 2mlabels:severity: warningannotations:summary: "异常时钟偏差,实例:{{ $labels.instance }}"description: "检测到时钟偏差,时钟不同步。值为:{{ $value }}"- alert: HostClockNotSynchronisingexpr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16for: 2mlabels:severity: warningannotations:summary: "时钟不同步,实例:{{ $labels.instance }}"description: "时钟不同步"- alert: NodeFileDescriptorLimitexpr: node_filefd_allocated / node_filefd_maximum * 100 > 80for: 1mlabels:severity: warningannotations:summary: "预计内核将很快耗尽文件描述符限制"description: "{{ $labels.instance }}}已分配的文件描述符数超过了限制的80%,当前值为:{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/docker.yml文件,完整内容如下:

groups:
- name: DockerContainersrules:- alert: ContainerKilledexpr: time() - container_last_seen > 60for: 0mlabels:severity: warningannotations:isummary: "Docker容器被杀死 容器:{{ $labels.instance }}"description: "{{ $value }}个容器消失了"# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.- alert: ContainerAbsentexpr: absent(container_last_seen)for: 5mlabels:severity: warningannotations:summary: "无容器 容器: {{ $labels.instance }}"description: "5分钟检查容器不存在,值为:{{ $value }}"- alert: ContainerCpuUsageexpr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 300for: 2mlabels:severity: warningannotations:summary: "容器cpu使用率告警 容器: {{ $labels.instance }}"description: "容器cpu使用率超过300%,当前值为:{{ $value }}"- alert: ContainerMemoryUsageexpr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80for: 2mlabels:severity: warningannotations:summary: "容器内存使用率告警 容器: {{ $labels.instance }}"description: "容器内存使用率超过80%,当前值为:{{ $value }}"- alert: ContainerVolumeIoUsageexpr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80for: 2mlabels:severity: warningannotations:summary: "容器存储io使用率告警 容器: {{ $labels.instance }}"description: "容器存储io使用率超过 80%,当前值为:{{ $value }}"- alert: ContainerHighThrottleRateexpr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1for: 2mlabels:severity: warningannotations:summary: "容器限制告警 容器:{{ $labels.instance }}"description: "容器被限制,当前值为:{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/mysqld.yml文件,完整内容如下:

groups:
- name: MySQLrules:- alert: MysqlDownexpr: mysql_up == 0for: 30slabels:severity: criticalannotations:summary: "MySQL Down,实例:{{ $labels.instance }}"description: "MySQL_exporter连不上MySQL了,当前状态为:{{ $value }}"    - alert: MysqlTooManyConnectionsexpr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80for: 2mlabels:severity: warningannotations:summary: "Mysql连接数过多告警,实例:{{ $labels.instance }}"description: "MySQL连接数>80%,当前值:{{ $value }}"- alert: MysqlHighThreadsRunningexpr: max_over_time(mysql_global_status_threads_running[1m]) > 20for: 2mlabels:severity: warningannotations:summary: "Mysql运行的线程过多,实例:{{ $labels.instance }}"description: "Mysql运行的线程 > 20,当前运行的线程:{{ $value }}" - alert: MysqlSlowQueriesexpr: increase(mysql_global_status_slow_queries[2m]) > 0for: 2mlabels:severity: warningannotations:summary: "Mysql慢日志告警,实例:{{ $labels.instance }}"description: "MySQL在过去2分钟有新的{{ $value }}条慢查询"#MySQL innodb 日志写入停滞- alert: MysqlInnodbLogWaitsexpr: rate(mysql_global_status_innodb_log_waits[15m]) > 10for: 0mlabels:severity: warningannotations:summary: "MySQL innodb日志等待,实例:{{ $labels.instance }}"description: "MySQL innodb日志写入停滞,当前值: {{ $value }}"- alert: MysqlRestartedexpr: mysql_global_status_uptime < 60for: 0mlabels:severity: infoannotations:summary: "MySQL 重启,实例:{{ $labels.instance }}"description: "不到一分钟前,MySQL重启过"- alert: RowLockCurrentWaitsexpr: mysql_global_status_innodb_row_lock_current_waits > 0for: 1mlabels:severity: infoannotations:summary: "MySQL有锁等待,实例:{{ $labels.instance }}"description: "当前有{{ $value }}个锁等待"

添加/opt/software/prometheus/prometheus/rules/nginx.yml文件,完整内容如下:

groups:
- name: nginxrules:# 对任何实例超过30秒无法联系的情况发出警报- alert: NginxDownexpr: nginx_up == 0for: 30slabels:severity: criticalannotations:summary: "nginx异常,实例:{{ $labels.instance }}"description: "{{ $labels.job }} nginx已关闭"

添加/opt/software/prometheus/prometheus/rules/redis.yml文件,完整内容如下:

groups:
- name: redisrules:- alert: RedisDownexpr: redis_up == 0for: 0mlabels:severity: criticalannotations:summary: 'Redis Down,实例:{{ $labels.instance }}'description: "Redis实例 is down"- alert: RedisMissingBackupexpr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24for: 0mlabels:severity: criticalannotations:summary: "Redis备份丢失,实例:{{ $labels.instance }}"description: "Redis 24小时未备份"- alert: RedisOutOfConfiguredMaxmemoryexpr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90for: 2mlabels:severity: warningannotations:summary: "Redis超出配置的最大内存,实例:{{ $labels.instance }}"description: "Redis内存使用超过配置最大内存的90%"- alert: RedisTooManyConnectionsexpr: redis_connected_clients > 100for: 2mlabels:severity: warningannotations:summary: "Redis连接数过多,实例:{{ $labels.instance }}"description: "Redis当前连接数为: {{ $value }}"- alert: RedisNotEnoughConnectionsexpr: redis_connected_clients < 1for: 2mlabels:severity: warningannotations:summary: "Redis没有足够的连接,实例:{{ $labels.instance }}"description: "Redis当前连接数为: {{ $value }}"- alert: RedisRejectedConnectionsexpr: increase(redis_rejected_connections_total[1m]) > 0for: 0mlabels:severity: criticalannotations:summary: "Redis有拒绝连接,实例:{{ $labels.instance }}"description: "与Redis 的某些连接被拒绝{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/springboot.yml文件,完整内容如下:

groups:
- name: SprinBootrules:- alert: SprinBooErrorEventsexpr: increase(logback_events_total{level="error"}[3m]) > 0for: 30slabels:severity: warningannotations:summary: "Springboot错误事件 容器: $labels.instance"description: "在过去2分钟有新的{{ $value }}个错误事件"

添加/opt/software/prometheus/prometheus/rules/blackbox_exporter.yml文件,完整内容如下:

groups:
- name: Blackboxrules:- alert: 黑盒子探测失败告警expr: probe_success == 0for: 1mlabels:severity: criticalannotations:summary: "黑盒子探测失败{{ $labels.instance }}"description: "黑盒子检测失败,当前值:{{ $value }}"- alert: 请求慢告警expr: avg_over_time(probe_duration_seconds[1m]) > 1for: 1mlabels:severity: warningannotations:summary: "请求慢{{ $labels.instance }}"description: "请求时间超过1秒,值为:{{ $value }}"- alert: http状态码检测失败expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400for: 1mlabels:severity: criticalannotations:summary: "http状态码检测失败{{ $labels.instance }}"description: "HTTP状态码非 200-399,当前状态码为:{{ $value }}"- alert: ssl证书即将到期expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30for: 1mlabels:severity: warningannotations:summary: "证书即将到期{{ $labels.instance }}"description: "SSL 证书在 30 天后到期,值:{{ $value }}"- alert: ssl证书即将到期expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3for: 1mlabels:severity: criticalannotations:summary: "证书即将到期{{ $labels.instance }}"description: "SSL 证书在 3 天后到期,值:{{ $value }}"- alert: ssl证书已过期expr: probe_ssl_earliest_cert_expiry - time() <= 0for: 1mlabels:severity: criticalannotations:summary: "证书已过期{{ $labels.instance }}"description: "SSL 证书已经过期,请确认是否在使用"

3.1.4 自定义Systemctl

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/prometheus/prometheus \--config.file=/opt/software/prometheus/prometheus/prometheus.yml \--storage.tsdb.path=/opt/software/prometheus/prometheus/data \--storage.tsdb.retention.time=60d \--web.enable-lifecycle[Install]
WantedBy=multi-user.targetEOF

加载自定义的systemctl配置

systemctl daemon-reload

3.1.5 启动Prometheus

# 启动Prometheus
systemctl start prometheus
# 开机自启动Prometheus
systemctl enable prometheus

3.1.6.访问Prometheus后台

浏览器访问:http://192.168.25.41:9090/

在这里插入图片描述

3.1.7 检查target是否已经加载

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

3.1.8 检查rules是否已经加载

在这里插入图片描述
在这里插入图片描述在这里插入图片描述
在这里插入图片描述

3.1.9.备注说明

注一:curl -X POST http://192.168.25.41:9090/-/reload命令可在运行期间重新加载Prometheus配置文件
注二:查看Prometheus运行日志:journalctl -u prometheus.service
注三:Prometheus管理地址访问:http://192.168.25.41:9090/
注四:检查配置文件语法是否通过:./promtool check config prometheus.yml

3.2.Grafana安装

3.2.1.安装流程

Grafana下载地址:https://grafana.com/grafana/download/11.5.1?platform=linux:

# 下载安装包
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.5.1.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf grafana-enterprise-11.5.1.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/grafana-v11.5.1 /opt/software/prometheus/grafana

3.2.2 自定义Systemctl

cat > /etc/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=Grafana server
Documentation=http://docs.grafana.org
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/grafana/bin/grafana-server \--config=/opt/software/prometheus/grafana/conf/defaults.ini \--homepath=/opt/software/prometheus/grafana
[Install]
WantedBy=multi-user.targetEOF

加载自定义的systemctl配置

systemctl daemon-reload

3.2.3 启动Grafana

# 启动Prometheus
systemctl start grafana-server
# 开机自启动Prometheus
systemctl enable grafana-server

3.2.4 访问Grafana后台

浏览器访问:http://192.168.25.41:3000/
默认账号/密码 :admin/admin
默认账号密码可详见:/opt/software/prometheus/grafana/conf/defaults.ini文件

在这里插入图片描述
输入账号密码进行登陆…
在这里插入图片描述

3.2.5 添加Prometheus数据源

在这里插入图片描述
在这里插入图片描述
然后一直点下一步即可创建出数据源。

3.2.6 添加Dashboard(服务器监控仪表盘)

Grafana的node exporter full地址:https://grafana.com/grafana/dashboards/1860-node-exporter-full/

具体下载链接:https://grafana.com/api/dashboards/1860/revisions/33/download
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
最终效果:
在这里插入图片描述
在这里插入图片描述

3.2.7 添加Dashboard(容器监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/11600-docker-container/

具体下载链接:https://grafana.com/api/dashboards/11600/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.2.8 添加JAVA监控(JVM监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/4701-jvm-micrometer//

具体下载链接:https://grafana.com/api/dashboards/4701/revisions/10/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.2.9 添加Mysql监控(Mysql监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/7362-mysql-overview/

具体下载链接:https://grafana.com/api/dashboards/7362/revisions/5/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.2.10 添加Nginx监控(Nginx监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/12708-nginx/

具体下载链接:https://grafana.com/api/dashboards/12708/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.2.11 添加Redis监控(Redis监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/11835-redis-dashboard-for-prometheus-redis-exporter-helm-stable-redis-ha

具体下载链接:https://grafana.com/api/dashboards/11835/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.2.12 添加黑盒监控(黑盒监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/9965-1-blackbox-exporter-dashboard-20220412/

具体下载链接:https://grafana.com/api/dashboards/9965/revisions/3/download

下载json文件后,将json文件导入dashboard。效果如下:
在这里插入图片描述

3.3.AlertManager安装

3.3.1.安装流程

官网下载地址点我

# 下载安装包
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.0/alertmanager-0.28.0.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf alertmanager-0.28.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/alertmanager-0.28.0.linux-amd64 /opt/software/prometheus/alertmanager

3.3.2.修改alertmanager.yml

vim /opt/software/prometheus/alertmanager/alertmanager.yml

完整内容如下:

global:#163服务器smtp_smarthost: 'smtp.163.com:465'#发邮件的邮箱smtp_from: 'xxxxxx@163.com'#发邮件的邮箱用户名,也就是你的邮箱     smtp_auth_username: 'xxxxxx@163.com'#发邮件的邮箱密码(授权码)smtp_auth_password: 'xxxxxxxxx'#tls验证配置,false为关闭smtp_require_tls: false#模版配置
templates:- '/opt/software/prometheus/alertmanager/template/*.tmpl'route:group_by: ['altername']# 当收到告警的时候,等待group_wait配置的时间10s,看是否还有告警,如果有就一起发出去group_wait: 10s# 如果上次告警信息发送成功,此时又来了一个新的告警数据,则需要等待group_interval配置的时间才可以发送出去group_interval: 10s# 如果上次告警信息发送成功,且问题没有解决,则等待 repeat_interval配置的时间再次发送告警数据repeat_interval: 1m# 全局报警组,这个参数是必选的,和下面报警组名要相同#receiver: 'email'	# 邮件通知#receiver: 'wechat'	# 企业微信通知receiver: 'web.hook'	# API接口通知receivers: - name: 'email'email_configs:# 邮件通知人1- to: 'xxxx@163.com'  # 报警修复时也发送通知send_resolved: true# 指定邮件模板html: '{{ template "email.html" .}}'# 邮件通知人2- to: 'jxf1905216170@foxmail.com'# 报警修复时也发送通知send_resolved: true# 指定邮件模板html: '{{ template "email.html" .}}'- name: "wechat"webhook_configs:# 企业微信通知服务暴露出来的URL,后面会讲。- url: 'http://192.168.25.41:8999/webhook?key=xxxxxx'send_resolved: true- name: 'web.hook'webhook_configs:# 能处理告警内容的服务暴露出来的API接口(用户自定义服务来处理告警信息)。- url: 'http://192.168.25.2:8080/webhook'# 抑制规则。下面意思为:如果有一个严重程度为“critical”的告警(源告警),并且存在另一个严重程度为“warning”的告警(目标告警),且这两个告警的 alertname、dev 和 instance 标签值都相同,那么系统将不会发送或显示那个“warning”级别的告警。
inhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']

3.3.3.邮件通知相关配置

3.3.3.1 获取163邮件授权码

位置:设置 》POP3/SMTP /IMAP》新增授权码
在这里插入图片描述

在这里插入图片描述

3.3.3.2 编定义邮件模板(可选)

注:自定义邮件模板是可选的,你也可以不设置自定义邮件模板,此时用的就是默认邮件模板。

mkdir /opt/software/prometheus/alertmanager/template/

3.3.4.企业微信通知相关配置

3.3.4.1.获取企业微信机器人webhook

1、首先在企业微信里面创建一个 个人组织,然后在聊天群中创建一个机器人。
在这里插入图片描述
2、创建成功后,进入全员群(刚创建的那个群):
在这里插入图片描述
3、在全员群里面添加机器人:
在这里插入图片描述

添加成功后,会有一个机器人对应的Webhook地址,里面包含一个KEY。这KEY很重要,alertmanager.yml里面配置企业微信告警通知的时候会使用到这个KEY。具体配置查看上文章节3.3.2

在这里插入图片描述

3.3.4.2.配置企业微信机器人告警通知服务

下载地址:https://download.csdn.net/download/jianghuchuang/90415760

文件下载后打开,如下所示:

[root@prometheus alertmanager-wechatrobot-webhook]# pwd
/opt/software/prometheus/alertmanager-wechatrobot-webhook[root@prometheus alertmanager-wechatrobot-webhook]# ll
总用量 10700
-rwxr-xr-x 1 root root 10924527 216 18:50 alertmanager-wechatrobot-webhook
-rw-r--r-- 1 root root      265 216 18:51 alertmanager-wechatrobot-webhook.service
-rw-r--r-- 1 root root      554 216 18:50 Dockerfile
-rw-r--r-- 1 root root    11558 216 18:50 LICENSE
-rw-r--r-- 1 root root      684 216 18:50 README.md
-rw-r--r-- 1 root root      164 216 18:50 start.sh

自定义Systemctl:

 cat > /etc/systemd/system/alertmanager-wechatrobot-webhook.service << 'EOF'
[Unit]
Description=alertmanager-wechatrobot-webhook[Service]
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/alertmanager-wechatrobot-webhook/alertmanager-wechatrobot-webhook[Install]
WantedBy=multi-user.targetEOF

启动服务:

#重新加载systemctl
systemctl daemon-reload
#启动服务
systemctl start alertmanager-wechatrobot-webhook
#开机自动
systemctl enable alertmanager-wechatrobot-webhook

注:启动后,当前会暴露一个接口:
http://192.168.25.41:8999/webhook?key=xxxxxx。这个地址配置在alertmanager.yml里面(具体配置查看上文章节3.3.2)。key是企业微信机器人Webhook里面的key

3.3.5 自定义服务通知相关配置(springboot为例)

自定义服务接收告警消息,以springboot2.6.2为例子。

3.3.5.1 修改pom.xml
<!-- spring-boot-actuator依赖 --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-actuator</artifactId><version>2.6.2</version></dependency><!-- prometheus依赖!!!!这个千万不要加版本号,不然访问不到/actuator/prometheus接口 --><!-- https://mvnrepository.com/artifact/io.micrometer/micrometer-registry-prometheus --><dependency><groupId>io.micrometer</groupId><artifactId>micrometer-registry-prometheus</artifactId></dependency>
3.3.5.2 修改application.yml

application.yml 完整配置如下

server:port: 8080
spring:application:name: my-application-name
#management:
#  endpoints:
#    web:
#      exposure:
#        include: "*"
#  metrics:
#    tags:
#      myApplication: ${spring.application.name}
#      myTag:  my-application-tagmanagement:metrics:tags:application: ${spring.application.name}export:prometheus:enabled: true#开启暴露webendpoints:web:exposure:include: prometheus
3.3.5.3 添加webhook接口
package com.jxf.controller;import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;import java.text.SimpleDateFormat;
import java.time.Instant;
import java.time.ZonedDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Date;@Slf4j
@RestController
@RequestMapping("")
public class WebController {/*** 处理alert manager传过来的告警消息* @param payload 告警内容示例:{"receiver":"web\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"http状态码检测失败","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"HTTP状态码非 200-399,当前状态码为:0","summary":"http状态码检测失败https://www.baidu.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_http_status_code+%3C%3D+199+or+probe_http_status_code+%3E%3D+400\u0026g0.tab=1","fingerprint":"43644ea5fd1763ae"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:8.00394340575","summary":"请求慢https://www.baidu.com"},"startsAt":"2025-02-16T13:46:24.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"8e7993450dd4f17a"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败https://www.baidu.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"62e21997d2d79e55"},{"status":"firing","labels":{"alertname":"http状态码检测失败","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"HTTP状态码非 200-399,当前状态码为:0","summary":"http状态码检测失败https://www.jd.com"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_http_status_code+%3C%3D+199+or+probe_http_status_code+%3E%3D+400\u0026g0.tab=1","fingerprint":"9760f8b330d656e7"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:8.003534726249999","summary":"请求慢https://www.jd.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"a5440fd5910568ef"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败https://www.jd.com"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"d18163c9f43b4cba"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"内部公司","instance":"192.168.25.200","job":"blackbox_icmp","project":"测试项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:9.50104220425","summary":"请求慢192.168.25.200"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"f952edd5054e1767"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"内部公司","instance":"192.168.25.200","job":"blackbox_icmp","project":"测试项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败192.168.25.200"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"e79230366cea2dc2"}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"http://prometheus:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}* @return*/@PostMapping("/webhook")public ResponseEntity<String> receiveAlert(@RequestBody String payload) {JSONObject data = JSON.parseObject(payload);JSONArray alerts = JSON.parseArray(data.getString("alerts"));log.error("======================收到告警消息({}条)========================", alerts.size());for (int i = 0; i < alerts.size(); i++) {JSONObject alert = alerts.getJSONObject(i);String status = alert.getString("status");String startsAt = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss").format(Date.from(Instant.parse(alert.getString("startsAt"))));String endsAt = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss").format(Date.from(Instant.parse(alert.getString("endsAt"))));JSONObject annotations = alert.getJSONObject("annotations");String summary = annotations.getString("summary");String description = annotations.getString("description");log.error("收到告警消息:类型{}, 开始时间:{}, 结束时间:{}, 主题:{}, 详情:{}", status, startsAt, endsAt, summary, description);}log("====================================================");// 返回响应,表示已成功接收告警return new ResponseEntity<>("Alerts received", HttpStatus.OK);}}

3.3.6 自定义Systemctl

cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/software/prometheus/alertmanager/alertmanager \--config.file=/opt/software/prometheus/alertmanager/alertmanager.yml \--storage.path=/opt/software/prometheus/alertmanager/dataRestart=always[Install]
WantedBy=multi-user.targetEOF

加载自定义的systemctl配置

systemctl daemon-reload

3.3.7 启动AlertManager

# 启动alertmanager
systemctl start alertmanager
# 开机自启动alertmanager
systemctl enable alertmanager

3.3.8 告警效果展示

3.3.8.1 邮件告警效果展示

:首先确保你的服务

默认使用的邮件告警模板如下
在这里插入图片描述
在这里插入图片描述


使用自定义的邮件告警模板后,效果如下
在这里插入图片描述

3.3.8.2 企业微信告警效果展示

在这里插入图片描述

相关文章:

  • YOLO v2:目标检测领域的全面性进化
  • 网络防空总结 各种攻击
  • 光流 | Matlab工具中的光流算法
  • acwing 4275. Dijkstra序列
  • JVM学习专题(二)内存模型深度剖析
  • 尚硅谷阳哥JVM
  • upload-labs通关笔记-第5关 文件上传之.ini绕过
  • CSS Grid布局:从入门到实战
  • 【windows server脚本每天从网络盘复制到本地】
  • AI世界的崩塌:当人类思考枯竭引发数据生态链断裂
  • 数据安全与权限管控,如何实现双重保障?
  • Vue3学习(组合式API——计算属性computed详解)
  • Android学习总结之Glide自定义三级缓存(面试篇)
  • 关于 Golang GC 机制的一些细节:什么是根对象?GC 机制的触发时机?
  • “天神之眼”计算平台的算力设计(预计500-1000 TOPS)
  • 基于EFISH-SCB-RK3576/SAIL-RK3576的无人快递柜控制器技术方案
  • 【sql】按照数据的日期/天 ,对入库数据做数量分类
  • 驾驭数据洪流:大数据治理的全面解析与实战方案
  • ⭐️⭐️⭐️【课时6:如何创建工作流应用】学习总结 ⭐️⭐️⭐️ for《大模型Clouder认证:基于百炼平台构建智能体应用》认证
  • Git的安装和配置(idea中配置Git)
  • 澎湃·镜相第二届非虚构写作大赛初选入围名单公示
  • 著名植物学家、园艺学家,国际植物园协会原主席贺善安逝世
  • 受美关税影响,本田预计新财年净利下降七成,并推迟加拿大建厂计划
  • 孙简任吉林省副省长
  • 大外交|中美联合声明拉升全球股市,专家:中美相向而行为世界提供确定性
  • 长沙通报一出租房疑存非法代孕:查封涉事场所,相关人员被控制