当前位置：首页 > news >正文

Prometheus+Grafana+AlertManager完整安装过程

news 2025/7/6 8:00:43

文章目录

1.概述
2.被监控服务器相关软件安装
- 2.1 Docker、Docker-Compose安装
- 2.2 基础软件、采集软件容器化部署
3.监控服务器核心组件安装
- 3.1.Prometheus安装
- - 3.1.1 安装流程
  - 3.1.2 修改prometheus.yml
  - 3.1.3 添加监控规则文件
  - 3.1.4 自定义Systemctl
  - 3.1.5 启动Prometheus
  - 3.1.6.访问Prometheus后台
  - 3.1.7 检查target是否已经加载
  - 3.1.8 检查rules是否已经加载
  - 3.1.9.备注说明
- 3.2.Grafana安装
- - 3.2.1.安装流程
  - 3.2.2 自定义Systemctl
  - 3.2.3 启动Grafana
  - 3.2.4 访问Grafana后台
  - 3.2.5 添加Prometheus数据源
  - 3.2.6 添加Dashboard(服务器监控仪表盘)
  - 3.2.7 添加Dashboard(容器监控仪表盘)
  - 3.2.8 添加JAVA监控(JVM监控仪表盘)
  - 3.2.9 添加Mysql监控(Mysql监控仪表盘)
  - 3.2.10 添加Nginx监控(Nginx监控仪表盘)
  - 3.2.11 添加Redis监控(Redis监控仪表盘)
  - 3.2.12 添加黑盒监控(黑盒监控仪表盘)
- 3.3.AlertManager安装
- - 3.3.1.安装流程
  - 3.3.2.修改alertmanager.yml
  - 3.3.3.邮件通知相关配置
  - - 3.3.3.1 获取163邮件授权码
    - 3.3.3.2 编定义邮件模板(可选)
  - 3.3.4.企业微信通知相关配置
  - - 3.3.4.1.获取企业微信机器人webhook
    - 3.3.4.2.配置企业微信机器人告警通知服务
  - 3.3.5 自定义服务通知相关配置(springboot为例)
  - - 3.3.5.1 修改pom.xml
    - 3.3.5.2 修改application.yml
    - 3.3.5.3 添加webhook接口
  - 3.3.6 自定义Systemctl
  - 3.3.7 启动AlertManager
  - 3.3.8 告警效果展示
  - - 3.3.8.1 邮件告警效果展示
    - 3.3.8.2 企业微信告警效果展示

1.概述

服务器资源情况：

服务器名	IP	CPU	内存	描述
prometheus	192.168.25.41	1	1.9G	安装Prometheus、Grafana、AlertManager。模拟监控的服务器
prometheus-monitor-node	192.168.25.42	1	1.9G	安装一些基础软件、应用服务。模拟被监控的服务器

192.168.25.41 监控服务器上软件安装情况：

IP	端口	软件名称	版本	安装方式	完整安装包名称	备注
192.168.25.41	9090	prometheus	3.1.0	二进制方式	prometheus-3.1.0.linux-amd64.tar.gz	监控的核心组件
192.168.25.41	3000	grafana	11.5.1	二进制方式	grafana-enterprise-11.5.1.linux-amd64.tar.gz	提供图形化监控数据展示
192.168.25.41	9093	alertmanager	0.28.0	二进制方式	alertmanager-0.28.0.linux-amd64.tar.gz	告警通知
192.168.25.41	9100	node_exporter	1.8.2	二进制方式	node_exporter-1.8.2.linux-amd64.tar.gz	(可选安装)对服务器资源进行监控，并把数据提供给Prometheus。

192.168.25.42 被监控服务器上软件安装情况：

IP	端口	软件名称	版本	安装方式	备注
192.168.25.42		docker	24.0.2	二进制方式	用于运行其他软件、服务
192.168.25.42		docker-compose	v2.5.0	二进制方式
192.168.25.42	80	nginx	1.27.4	容器方式	Nginx服务
192.168.25.42	6379	redis	6.2.17	容器方式	Redis服务
192.168.25.42	3306	mysql	6.2.17	容器方式	MySql服务
192.168.25.42	8081	java-web-demo	0.0.4	容器方式	java应用服务(这是本人私有的镜像)
192.168.25.42	8080	cadvisor	v0.33.0	容器方式	用于采集Docker的监控数据
192.168.25.42	9091	pushgateway	v1.11.0	容器方式	用户可将采集数据发到pushgateway，由pushgatway将采集数据推到Prometheus
192.168.25.42	8080	cadvisor	v0.33.0	容器方式	用于采集Docker的监控数据
192.168.25.42	9100	node_exporter	v1.5.0	容器方式	用于采集服务器的监控数据
192.168.25.42	9113	nginx_exporter	v1.5.0	容器方式	用于采集Nginx的监控数据
192.168.25.42	9121	redis_exporter	v1.5.0	容器方式	用于采集Redis的监控数据
192.168.25.42	9104	mysqld-exporter	v1.5.0	容器方式	用于采集MySql的监控数据（也可监控Mariadb）
192.168.25.42	9100	blackbox_exporter	0.25.0	容器方式	采集外部地址数据。如http、tcp、icpm等

注：软件下载地址：https://prometheus.io/download/

2.被监控服务器相关软件安装

2.1 Docker、Docker-Compose安装

参考安装文档：Centos7.9离线安装Docker24（无坑版）

或直接使用：docker-v1.24 一键部署脚本

2.2 基础软件、采集软件容器化部署

创建docker-compose.yaml文件
内容如下：

version: '3.3'volumes:prometheus_data: {}grafana_data: {}networks:monitoring:driver: bridgeservices:nginx:image: nginx:1.27.4container_name: nginxrestart: alwaysports:- 80:80volumes:- /etc/localtime:/etc/localtime:ro- /root/default.conf:/etc/nginx/conf.d/default.confenvironment:- TZ=Asia/Shanghairedis:image: redis:6.2.17container_name: rediscommand: redis-server --requirepass 123456 --maxmemory 512mbrestart: always#volumes:#  - /data/redis/data:/dataports:- 6379:6379db:image: mysql:8.0.37restart: alwayscontainer_name: mysqlenvironment:TZ: Asia/ShanghaiLANG: en_US.UTF-8MYSQL_ROOT_PASSWORD: 123456command:--default-authentication-plugin=mysql_native_password--character-set-server=utf8mb4--collation-server=utf8mb4_general_ci--lower_case_table_names=1--performance_schema=1--sql-mode=""--skip-log-binvolumes:- /data/mysql/conf:/etc/mysql/conf.d #数据文件挂载- /data/mysql/data:/var/lib/mysql #数据文件挂载ports:- 3306:3306java-web-demo:image: java-web-demo:0.0.4container_name: springboot-web-demo restart: alwaysports:- 8081:8080 pushgateway:image: prom/pushgateway:v1.11.0container_name: pushgatewayrestart: alwaysports:- "9091:9091"cadvisor:image: google/cadvisor:v0.33.0container_name: cadvisorrestart: alwaysprivileged: trueports:- 8080:8080volumes:- /etc/localtime:/etc/localtime:ro- /:/rootfs:ro- /var/run:/var/run:rw- /sys:/sys:ro- /var/lib/docker/:/var/lib/docker:roenvironment:- TZ=Asia/Shanghainetworks:- monitoringnode_exporter:image: prom/node-exporter:v1.5.0container_name: node-exporterrestart: alwaysports:- 9100:9100network_mode: "host"volumes:- /etc/localtime:/etc/localtime:ro- /proc:/host/proc:ro- /sys:/host/sys:ro- /:/rootfs:roenvironment:TZ: Asia/Shanghaicommand: - '--web.listen-address=:9100'- '--path.procfs=/host/proc' - '--path.sysfs=/host/sys'- "--path.rootfs=/rootfs"- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'nginx_exporter:image: nginx/nginx-prometheus-exporter:0.11container_name: nginx_exporterhostname: nginx_exportercommand:- '-nginx.scrape-uri=http://192.168.25.42/stub_status'restart: alwaysports:- "9113:9113"redis_exporter:image: oliver006/redis_exportercontainer_name: redis_exporterrestart: alwaysenvironment:REDIS_ADDR: "192.168.25.42:6379"REDIS_PASSWORD: 123456ports:- "9121:9121"mysqld-exporter:image: prom/mysqld-exportercontainer_name: mysqld-exporterrestart: alwayscommand:- '--collect.info_schema.processlist'- '--collect.info_schema.innodb_metrics'- '--collect.info_schema.tablestats'- '--collect.info_schema.tables'- '--collect.info_schema.userstats'- '--collect.engine_innodb_status'- '--config.my-cnf=/my.cnf'volumes:- /root/my.cnf:/my.cnfports:- 9104:9104blackbox_exporter:image: prom/blackbox-exporter:v0.25.0container_name: blackbox_exporterrestart: alwaysvolumes:- /data/blackbox_exporter:/etc/blackbox_exporterports:- 9115:9115

上面数据卷挂载将宿主机中指定文件映射到容器，故而宿主机上还需要有相关文件或者文件夹，具体如下：

Nginx的default.conf文件完整内容如下：：

[root@prometheus-monitor-node ~]# cat default.conf
server {listen       80;listen  [::]:80;server_name  localhost;#access_log  /var/log/nginx/host.access.log  main;location / {root   /usr/share/nginx/html;index  index.html index.htm;}location /stub_status {stub_status on;access_log off;#allow nginx_export的ip;    allow 0.0.0.0/0;deny all;}#error_page  404              /404.html;# redirect server error pages to the static page /50x.html#error_page   500 502 503 504  /50x.html;location = /50x.html {root   /usr/share/nginx/html;}
}

Mysql的my.conf文件完整内容

[root@prometheus-monitor-node ~]# cat my.cnf 
[client]
host=192.168.25.42
user=exporter
password=123456
port=3306

黑盒blackbox_exporter的config.yml文件完整内容如下：

注：/data/blackbox_exporter只有一个config.yml文件，没有其他文件了。

mkdir -p /data/blackbox_exporter

[root@prometheus-monitor-node ~]# cat /data/blackbox_exporter/config.yml 
modules:http_2xx:prober: httphttp:method: GEThttp_post_2xx:prober: httphttp:method: POSTtcp_connect:prober: tcppop3s_banner:prober: tcptcp:query_response:- expect: "^+OK"tls: truetls_config:insecure_skip_verify: falsegrpc:prober: grpcgrpc:tls: truepreferred_ip_protocol: "ip4"grpc_plain:prober: grpcgrpc:tls: falseservice: "service1"ssh_banner:prober: tcptcp:query_response:- expect: "^SSH-2.0-"- send: "SSH-2.0-blackbox-ssh-check"irc_banner:prober: tcptcp:query_response:- send: "NICK prober"- send: "USER prober prober prober :prober"- expect: "PING :([^ ]+)"send: "PONG ${1}"- expect: "^:[^ ]+ 001"icmp:prober: icmpicmp_ttl5:prober: icmptimeout: 5sicmp:ttl: 5

运行相关服务

docker -f docker-compose.yaml up -d

最终效果如下：

[root@prometheus-monitor-node ~]# docker ps
CONTAINER ID   IMAGE                                  COMMAND                   CREATED          STATUS                    PORTS                                                  NAMES
e45f0d5ece57   google/cadvisor:v0.33.0                "/usr/bin/cadvisor -…"   32 minutes ago   Up 32 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              cadvisor
394ca6fc8915   prom/pushgateway:v1.11.0               "/bin/pushgateway"        3 days ago       Up 55 minutes             0.0.0.0:9091->9091/tcp, :::9091->9091/tcp              pushgateway
f5e38277ceca   prom/blackbox-exporter:v0.25.0         "/bin/blackbox_expor…"   7 days ago       Up 55 minutes             0.0.0.0:9115->9115/tcp, :::9115->9115/tcp              blackbox_exporter
1e38cd21cf0e   nginx:1.27.4                           "/docker-entrypoint.…"   7 days ago       Up 55 minutes             0.0.0.0:80->80/tcp, :::80->80/tcp                      nginx
bdc5b8b4b4f4   java-web-demo:0.0.4                    "java -jar app.jar"       9 days ago       Up 55 minutes             0.0.0.0:8081->8080/tcp, :::8081->8080/tcp              springboot-web-demo
e0b0e18151ea   prom/mysqld-exporter                   "/bin/mysqld_exporte…"   10 days ago      Up 55 minutes             0.0.0.0:9104->9104/tcp, :::9104->9104/tcp              mysqld-exporter
1528db64726b   redis:6.2.17                           "docker-entrypoint.s…"   10 days ago      Up 55 minutes             0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              redis
82eb7094818c   mysql:8.0.37                           "docker-entrypoint.s…"   10 days ago      Up 55 minutes             0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   mysql
452c7fb8c129   oliver006/redis_exporter               "/redis_exporter"         10 days ago      Up 55 minutes             0.0.0.0:9121->9121/tcp, :::9121->9121/tcp              redis_exporter
c63f2666c72b   nginx/nginx-prometheus-exporter:0.11   "/usr/bin/nginx-prom…"   10 days ago      Up 55 minutes             0.0.0.0:9113->9113/tcp, :::9113->9113/tcp              nginx_exporter
50ca88f777b8   prom/node-exporter:v1.5.0              "/bin/node_exporter …"   10 days ago      Up 55 minutes                                                                    node-exporter

注1：docker方式查看容器列表：docker ps
注2：docker方式重启容器：docker restart 容器名
注3：docker方式查看容器日志：docker logs -f 容器名
注4：docker-compose方式查看容器列表：docker-compose ps
注5：docker-compose重启/重新加载指定服务：docker -f docker-compose.yaml up -d 服务名

3.监控服务器核心组件安装

3.1.Prometheus安装

3.1.1 安装流程

# 下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
# 创建安装目录
mkdir -p /opt/software/prometheus/
# 解压到特定目录
tar -zxvf prometheus-3.1.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/prometheus-3.1.0.linux-amd64 /opt/software/prometheus/prometheus

这是我最终配置完成后的目录结构(如果后面安装看懵了请参考此目录结构)：

[root@prometheus prometheus]# tree /opt/software/prometheus/prometheus├── data
├── EOF
├── LICENSE
├── NOTICE
├── prometheus
├── prometheus.yml
├── prometheus.yml_bak
├── promtool
├── rules
│   ├── alert.yml
│   ├── blackbox_exporter.yml
│   ├── docker.yml
│   ├── mysqld.yml
│   ├── nginx.yml
│   ├── node-exporter.yml
│   ├── redis.yml
│   └── springboot.yml
└── targets├── blackbox_http.yml├── blackbox_icmp.yml├── blackbox_tcp.yml├── pushgateway.yml├── springboot.yml└── targets.yml

3.1.2 修改prometheus.yml

注：prometheus.yml里面的配置有两种配置方式，一种是静态文件配置方式，第二种是动态文件配置方式。你可以两种都尝试看看，推荐用第二种，

方式一(静态文件方式，每次配置发生变动需手动加载)：

修改prometheus.yml，完整内容如下：

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:- 192.168.25.41:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:- "rules/*"# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"static_configs:- targets: ["localhost:9090"]- job_name: "alertmanager"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.41:9093"]- job_name: "node_exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.41:9100"]labels:instance: Prometheus服务器myCusLabel: myCusVal- targets: ["192.168.25.42:9100"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "cadvisor"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:8080"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "nginx-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9113"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "redis-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9121"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "mysql-exporter"scrape_interval: 15sstatic_configs:- targets: ["192.168.25.42:9104"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "springboot-demo"scrape_interval: 15smetrics_path: '/actuator/prometheus'static_configs:- targets: ["192.168.25.42:8081"]labels:instance: 被监控服务器42myCusLabel: myCusValmytype: springboot- job_name: "pushgateway"scrape_interval: 15shonor_labels: true  #加上此配置，exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖static_configs: - targets: ["192.168.25.42:9091"]labels:instance: 被监控服务器42myCusLabel: myCusVal- job_name: "blackbox_http"metrics_path: /probeparams:module: [http_2xx]static_configs:- targets:- https://www.baidu.com- https://www.jd.comlabels:company: "外部公司"project: "外部项目"env: "dev"- targets: - http://192.168.25.42:8081/labels:describe: springboot-web-demo应用company: "内部公司"project: "内部项目"env: "test"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115- job_name: "blackbox_tcp"metrics_path: /probeparams:module: [tcp_connect]static_configs:- targets: - 192.168.25.42:22- 192.168.25.41:9090labels:company: "内部公司"project: "测试项目"env: "test"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115#icmp检查配置 ping- job_name: "blackbox_icmp"metrics_path: /probeparams:module: [icmp]static_configs:- targets: - 192.168.25.42- 192.168.25.200  # 不存在的iplabels:company: "内部公司"project: "测试项目"relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115

方式二(动态文件方式，每次配置发生变动会自动加载)：
修改prometheus.yml，完整内容如下：

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:- 192.168.25.41:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:- "rules/*"# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "file-target"file_sd_configs:- refresh_interval: 10sfiles:- targets/targets.yml- job_name: "file-sd-pushgateway"scrape_interval: 15shonor_labels: true  # 为true时，exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖file_sd_configs:- refresh_interval: 10sfiles:- targets/pushgateway.yml    - job_name: "file-application"scrape_interval: 15smetrics_path: '/actuator/prometheus'file_sd_configs:- refresh_interval: 10sfiles:- targets/springboot.yml- job_name: "file-blackbox_http"metrics_path: /probeparams:module: [http_2xx]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_http.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- source_labels: [__param_target]target_label: __param_myparamreplacement: myparamVal- target_label: __address__replacement: 192.168.25.42:9115#- regex: "__(.*)__"    默认__xx__的标签都会隐藏，这两行让__xx__标签替换成xx标签，故而用户可以看到这些xx标签#  action: labelmap- job_name: "file-blackbox_tcp"metrics_path: /probeparams:module: [tcp_connect]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_tcp.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115#icmp检查配置 ping- job_name: "file-blackbox_icmp"metrics_path: /probeparams:module: [icmp]file_sd_configs:- refresh_interval: 10sfiles: - targets/blackbox_icmp.ymlrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: 192.168.25.42:9115

创建targets目录，用于存放动态配置文件。

mkdir -p /opt/software/prometheus/prometheus/targets/

添加/opt/software/prometheus/prometheus/targets/targets.yml文件，完整内容如下：

- targets: ["192.168.25.41:9090"]labels:job: prometheus
- targets: ["192.168.25.41:9093"]labels:job: alertmanager
- targets: ["192.168.25.41:9100"]labels:job: node_exporterinstance: Prometheus服务器myCusLabel: myCusVal
- targets: ["192.168.25.42:9100"]labels:job: node_exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:8080"]labels:job: cadvisorinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9113"]labels:job: nginx-exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9121"]labels:job: redis-exporterinstance: 被监控服务器42myCusLabel: myCusVal
- targets: ["192.168.25.42:9104"]labels:job: mysql-exporterinstance: 被监控服务器42myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/springboot.yml文件，内容如下：

- targets: ["192.168.25.42:8081"]labels:job: java-applicationinstance: 被监控服务器42myCusLabel: myCusValmytype: springboot

添加/opt/software/prometheus/prometheus/targets/pushgateway.yml文件，完整内容如下：

- targets: ["192.168.25.42:9091"]labels:job: pushgatewayinstance: 被监控服务器42myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/blackbox_http.yml文件，完整内容如下：

- targets:- https://www.baidu.com- https://www.jd.comlabels:job: blackbox_httpcompany: "外部公司"project: "外部项目"env: "dev"
- targets: - http://192.168.25.42:8081/labels:job: blackbox_httpdescribe: springboot-web-demo应用company: "内部公司"project: "内部项目"env: "dev"

添加/opt/software/prometheus/prometheus/targets/blackbox_tcp.yml文件，完整内容如下：

- targets: - 192.168.25.42	  # 被监控的endpoint- 192.168.25.200  # 被监控的endpointlabels:job: blackbox_icmpcompany: "内部公司"project: "测试项目"

添加/opt/software/prometheus/prometheus/targets/blackbox_icmp.yml文件，完整内容如下：

- targets: - 192.168.25.42	  # 被监控的IP- 192.168.25.200  # 被监控的IPlabels:job: blackbox_icmpcompany: "内部公司"project: "测试项目"env: "dev"

3.1.3 添加监控规则文件

创建规则目录

mkdir /opt/software/prometheus/prometheus/rules/

添加/opt/software/prometheus/prometheus/rules/alert.yml件，完整内容如下：

groups:
- name: Prometheus alertrules:- alert: 服务告警expr: up==0for: 30slabels:serverity: criticalannotations:summary: "服务异常，实例{{$labels.instance}}"description: "{{$labels.job}}服务已关闭"

添加/opt/software/prometheus/prometheus/rules/node-exporter.yml文件，完整内容如下：

groups:- name: node-exporterrules:- alert: HostOutOfMemoryexpr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10for: 2mlabels:severity: warningannotations:summary: "主机内存不足,实例:{{ $labels.instance }}"description: "内存可用率<10%，当前值：{{ $value }}"- alert: HostMemoryUnderMemoryPressureexpr: rate(node_vmstat_pgmajfault[1m]) > 1000for: 2mlabels:severity: warningannotations:summary: "内存压力不足,实例:{{ $labels.instance }}"description: "节点内存压力大。 重大页面错误率高，当前值为：{{ $value }}"- alert: HostUnusualNetworkThroughputInexpr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100for: 5mlabels:severity: warningannotations:summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}"description: "网络流入流量 > 100 MB/s，当前值：{{ $value }}"- alert: HostUnusualNetworkThroughputOutexpr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100for: 5mlabels:severity: warningannotations:summary: "异常流出网络吞吐量，实例:{{ $labels.instance }}"description: "网络流出流量 > 100 MB/s，当前值为：{{ $value }}"- alert: HostUnusualDiskReadRateexpr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50for: 5mlabels:severity: warningannotations:summary: "异常磁盘读取,实例:{{ $labels.instance }}"description: "磁盘读取> 50 MB/s，当前值：{{ $value }}"- alert: HostUnusualDiskWriteRateexpr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50for: 2mlabels:severity: warningannotations:summary: "异常磁盘写入,实例:{{ $labels.instance }}"description: "磁盘写入> 50 MB/s，当前值：{{ $value }}"- alert: HostOutOfDiskSpaceexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0for: 2mlabels:severity: warningannotations:summary: "磁盘空间不足告警,实例:{{ $labels.instance }}"description: "剩余磁盘空间< 10% ，当前值：{{ $value }}"- alert: HostDiskWillFillIn24Hoursexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0for: 2mlabels:severity: warningannotations:summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}"description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽，当前值：{{ $value }}"- alert: HostOutOfInodesexpr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0for: 2mlabels:severity: warningannotations:summary: "磁盘Inodes不足,实例:{{ $labels.instance }}"description: "剩余磁盘 inodes < 10%，当前值： {{ $value }}"- alert: HostUnusualDiskReadLatencyexpr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0for: 2mlabels:severity: warningannotations:summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}"description: "磁盘读取延迟 > 100ms，当前值：{{ $value }}"- alert: HostUnusualDiskWriteLatencyexpr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0for: 2mlabels:severity: warningannotations:summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}"description: "磁盘写入延迟 > 100ms，当前值：{{ $value }}"- alert: high_load expr: node_load1 > 4for: 2mlabels:severity: pageannotations:summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}"description: "CPU1分钟负载>4，已经持续2分钟。当前值为：{{ $value }}"- alert: HostCpuIsUnderUtilizedexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80for: 1mlabels:severity: warningannotations:summary: "cpu负载高,实例:{{ $labels.instance }}"description: "cpu负载> 80%，当前值：{{ $value }}"- alert: HostCpuStealNoisyNeighborexpr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10for: 0mlabels:severity: warningannotations:summary: "CPU窃取率异常,实例:{{ $labels.instance }}"description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能，或者 Spot 实例可能失去信用，当前值：{{ $value }}"- alert: HostSwapIsFillingUpexpr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80for: 2mlabels:severity: warningannotations:summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}"description: "磁盘swap空间使用率>80%"- alert: HostNetworkReceiveErrorsexpr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01for: 2mlabels:severity: warningannotations:summary: "异常网络接收错误,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01，当前值:{{ $value }}"- alert: HostNetworkTransmitErrorsexpr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01for: 2mlabels:severity: warningannotations:summary: "异常网络传输错误,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01，当前值:{{ $value }}"- alert: HostNetworkInterfaceSaturatedexpr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000for: 1mlabels:severity: warningannotations:summary: "异常网络接口饱和,实例:{{ $labels.instance }}"description: "网卡{{ $labels.device }}正在超载，当前值{{ $value }}"- alert: HostConntrackLimitexpr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8for: 5mlabels:severity: warningannotations:summary: "异常连接数,实例:{{ $labels.instance }}"description: "连接数过大，当前连接数：{{ $value }}"- alert: HostClockSkewexpr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)for: 2mlabels:severity: warningannotations:summary: "异常时钟偏差,实例:{{ $labels.instance }}"description: "检测到时钟偏差，时钟不同步。值为：{{ $value }}"- alert: HostClockNotSynchronisingexpr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16for: 2mlabels:severity: warningannotations:summary: "时钟不同步,实例:{{ $labels.instance }}"description: "时钟不同步"- alert: NodeFileDescriptorLimitexpr: node_filefd_allocated / node_filefd_maximum * 100 > 80for: 1mlabels:severity: warningannotations:summary: "预计内核将很快耗尽文件描述符限制"description: "{{ $labels.instance }}｝已分配的文件描述符数超过了限制的80%，当前值为：{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/docker.yml文件，完整内容如下：

groups:
- name: DockerContainersrules:- alert: ContainerKilledexpr: time() - container_last_seen > 60for: 0mlabels:severity: warningannotations:isummary: "Docker容器被杀死 容器:{{ $labels.instance }}"description: "{{ $value }}个容器消失了"# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.- alert: ContainerAbsentexpr: absent(container_last_seen)for: 5mlabels:severity: warningannotations:summary: "无容器 容器: {{ $labels.instance }}"description: "5分钟检查容器不存在，值为：{{ $value }}"- alert: ContainerCpuUsageexpr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 300for: 2mlabels:severity: warningannotations:summary: "容器cpu使用率告警 容器: {{ $labels.instance }}"description: "容器cpu使用率超过300%，当前值为：{{ $value }}"- alert: ContainerMemoryUsageexpr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80for: 2mlabels:severity: warningannotations:summary: "容器内存使用率告警 容器: {{ $labels.instance }}"description: "容器内存使用率超过80%，当前值为：{{ $value }}"- alert: ContainerVolumeIoUsageexpr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80for: 2mlabels:severity: warningannotations:summary: "容器存储io使用率告警 容器: {{ $labels.instance }}"description: "容器存储io使用率超过 80%，当前值为：{{ $value }}"- alert: ContainerHighThrottleRateexpr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1for: 2mlabels:severity: warningannotations:summary: "容器限制告警 容器:{{ $labels.instance }}"description: "容器被限制，当前值为：{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/mysqld.yml文件，完整内容如下：

groups:
- name: MySQLrules:- alert: MysqlDownexpr: mysql_up == 0for: 30slabels:severity: criticalannotations:summary: "MySQL Down,实例:{{ $labels.instance }}"description: "MySQL_exporter连不上MySQL了，当前状态为：{{ $value }}"    - alert: MysqlTooManyConnectionsexpr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80for: 2mlabels:severity: warningannotations:summary: "Mysql连接数过多告警,实例:{{ $labels.instance }}"description: "MySQL连接数>80%,当前值：{{ $value }}"- alert: MysqlHighThreadsRunningexpr: max_over_time(mysql_global_status_threads_running[1m]) > 20for: 2mlabels:severity: warningannotations:summary: "Mysql运行的线程过多,实例:{{ $labels.instance }}"description: "Mysql运行的线程 > 20，当前运行的线程：{{ $value }}" - alert: MysqlSlowQueriesexpr: increase(mysql_global_status_slow_queries[2m]) > 0for: 2mlabels:severity: warningannotations:summary: "Mysql慢日志告警,实例:{{ $labels.instance }}"description: "MySQL在过去2分钟有新的{{ $value }}条慢查询"#MySQL innodb 日志写入停滞- alert: MysqlInnodbLogWaitsexpr: rate(mysql_global_status_innodb_log_waits[15m]) > 10for: 0mlabels:severity: warningannotations:summary: "MySQL innodb日志等待,实例:{{ $labels.instance }}"description: "MySQL innodb日志写入停滞，当前值： {{ $value }}"- alert: MysqlRestartedexpr: mysql_global_status_uptime < 60for: 0mlabels:severity: infoannotations:summary: "MySQL 重启,实例:{{ $labels.instance }}"description: "不到一分钟前，MySQL重启过"- alert: RowLockCurrentWaitsexpr: mysql_global_status_innodb_row_lock_current_waits > 0for: 1mlabels:severity: infoannotations:summary: "MySQL有锁等待,实例:{{ $labels.instance }}"description: "当前有{{ $value }}个锁等待"

添加/opt/software/prometheus/prometheus/rules/nginx.yml文件，完整内容如下：

groups:
- name: nginxrules:# 对任何实例超过30秒无法联系的情况发出警报- alert: NginxDownexpr: nginx_up == 0for: 30slabels:severity: criticalannotations:summary: "nginx异常,实例:{{ $labels.instance }}"description: "{{ $labels.job }} nginx已关闭"

添加/opt/software/prometheus/prometheus/rules/redis.yml文件，完整内容如下：

groups:
- name: redisrules:- alert: RedisDownexpr: redis_up == 0for: 0mlabels:severity: criticalannotations:summary: 'Redis Down,实例:{{ $labels.instance }}'description: "Redis实例 is down"- alert: RedisMissingBackupexpr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24for: 0mlabels:severity: criticalannotations:summary: "Redis备份丢失,实例:{{ $labels.instance }}"description: "Redis 24小时未备份"- alert: RedisOutOfConfiguredMaxmemoryexpr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90for: 2mlabels:severity: warningannotations:summary: "Redis超出配置的最大内存,实例:{{ $labels.instance }}"description: "Redis内存使用超过配置最大内存的90%"- alert: RedisTooManyConnectionsexpr: redis_connected_clients > 100for: 2mlabels:severity: warningannotations:summary: "Redis连接数过多,实例:{{ $labels.instance }}"description: "Redis当前连接数为： {{ $value }}"- alert: RedisNotEnoughConnectionsexpr: redis_connected_clients < 1for: 2mlabels:severity: warningannotations:summary: "Redis没有足够的连接,实例:{{ $labels.instance }}"description: "Redis当前连接数为： {{ $value }}"- alert: RedisRejectedConnectionsexpr: increase(redis_rejected_connections_total[1m]) > 0for: 0mlabels:severity: criticalannotations:summary: "Redis有拒绝连接,实例:{{ $labels.instance }}"description: "与Redis 的某些连接被拒绝{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/springboot.yml文件，完整内容如下：

groups:
- name: SprinBootrules:- alert: SprinBooErrorEventsexpr: increase(logback_events_total{level="error"}[3m]) > 0for: 30slabels:severity: warningannotations:summary: "Springboot错误事件 容器: $labels.instance"description: "在过去2分钟有新的{{ $value }}个错误事件"

添加/opt/software/prometheus/prometheus/rules/blackbox_exporter.yml文件，完整内容如下：

groups:
- name: Blackboxrules:- alert: 黑盒子探测失败告警expr: probe_success == 0for: 1mlabels:severity: criticalannotations:summary: "黑盒子探测失败{{ $labels.instance }}"description: "黑盒子检测失败，当前值：{{ $value }}"- alert: 请求慢告警expr: avg_over_time(probe_duration_seconds[1m]) > 1for: 1mlabels:severity: warningannotations:summary: "请求慢{{ $labels.instance }}"description: "请求时间超过1秒，值为：{{ $value }}"- alert: http状态码检测失败expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400for: 1mlabels:severity: criticalannotations:summary: "http状态码检测失败{{ $labels.instance }}"description: "HTTP状态码非 200-399，当前状态码为：{{ $value }}"- alert: ssl证书即将到期expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30for: 1mlabels:severity: warningannotations:summary: "证书即将到期{{ $labels.instance }}"description: "SSL 证书在 30 天后到期，值：{{ $value }}"- alert: ssl证书即将到期expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3for: 1mlabels:severity: criticalannotations:summary: "证书即将到期{{ $labels.instance }}"description: "SSL 证书在 3 天后到期，值：{{ $value }}"- alert: ssl证书已过期expr: probe_ssl_earliest_cert_expiry - time() <= 0for: 1mlabels:severity: criticalannotations:summary: "证书已过期{{ $labels.instance }}"description: "SSL 证书已经过期，请确认是否在使用"

3.1.4 自定义Systemctl

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/prometheus/prometheus \--config.file=/opt/software/prometheus/prometheus/prometheus.yml \--storage.tsdb.path=/opt/software/prometheus/prometheus/data \--storage.tsdb.retention.time=60d \--web.enable-lifecycle[Install]
WantedBy=multi-user.targetEOF

加载自定义的systemctl配置

systemctl daemon-reload

3.1.5 启动Prometheus

# 启动Prometheus
systemctl start prometheus
# 开机自启动Prometheus
systemctl enable prometheus

3.1.6.访问Prometheus后台

浏览器访问：http://192.168.25.41:9090/

在这里插入图片描述

3.1.7 检查target是否已经加载

在这里插入图片描述

3.1.8 检查rules是否已经加载

在这里插入图片描述

3.1.9.备注说明

注一：curl -X POST http://192.168.25.41:9090/-/reload命令可在运行期间重新加载Prometheus配置文件
注二：查看Prometheus运行日志：journalctl -u prometheus.service
注三：Prometheus管理地址访问：http://192.168.25.41:9090/
注四：检查配置文件语法是否通过：./promtool check config prometheus.yml

3.2.Grafana安装

3.2.1.安装流程

Grafana下载地址：https://grafana.com/grafana/download/11.5.1?platform=linux：

# 下载安装包
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.5.1.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf grafana-enterprise-11.5.1.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/grafana-v11.5.1 /opt/software/prometheus/grafana

3.2.2 自定义Systemctl

cat > /etc/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=Grafana server
Documentation=http://docs.grafana.org
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/grafana/bin/grafana-server \--config=/opt/software/prometheus/grafana/conf/defaults.ini \--homepath=/opt/software/prometheus/grafana
[Install]
WantedBy=multi-user.targetEOF

加载自定义的systemctl配置

systemctl daemon-reload

3.2.3 启动Grafana

# 启动Prometheus
systemctl start grafana-server
# 开机自启动Prometheus
systemctl enable grafana-server

3.2.4 访问Grafana后台

浏览器访问：http://192.168.25.41:3000/
默认账号/密码 :admin/admin
默认账号密码可详见：/opt/software/prometheus/grafana/conf/defaults.ini文件

在这里插入图片描述
输入账号密码进行登陆…

3.2.5 添加Prometheus数据源

在这里插入图片描述

然后一直点下一步即可创建出数据源。

3.2.6 添加Dashboard(服务器监控仪表盘)

Grafana的node exporter full地址：https://grafana.com/grafana/dashboards/1860-node-exporter-full/

具体下载链接：https://grafana.com/api/dashboards/1860/revisions/33/download
在这里插入图片描述

最终效果：