Prometheus 05-01: 告警规则与Alertmanager配置
Prometheus 05-01: 告警规则与Alertmanager配置
相关文档链接
告警规则设计、Alertmanager配置和通知集成
官方文档资源
- Alerting Overview - Prometheus告警系统概述
- Alerting Rules - 告警规则配置文档
- Alertmanager Configuration - Alertmanager完整配置参考
- Notification Template Reference - 通知模板参考文档
GitHub项目资源
- Awesome Prometheus Alerts - 常用告警规则集合
- Alertmanager - Alertmanager项目主页
- Alert Rules Examples - 官方告警规则示例
- Prometheus Operator Alerts - K8s环境告警规则
中文资源和教程
- Prometheus告警系统 - 云原生社区告警教程
- Alertmanager配置详解 - 详细的中文配置指南
- 告警规则最佳实践 - 告警配置最佳实践
- 微信钉钉告警集成 - 中文消息通知集成
在线工具和资源
- Alertmanager WebUI - Alertmanager Web界面文档
- PrometheusRule Validator - 告警规则验证工具
- Alert Rule Testing - 规则单元测试
- Grafana Alerting - Grafana集成告警功能
一、告警系统架构概述
1.1 Prometheus告警系统组件
Prometheus告警系统由以下核心组件构成:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Prometheus │───▶│ Alertmanager │───▶│ Receivers │
│ Server │ │ │ │ (Email/Slack/ │
│ │ │ │ │ WebHook etc.) │
└─────────────────┘ └─────────────────┘ └─────────────────┘│ │ ││ │ │┌────▼────┐ ┌───▼───┐ ┌────▼────┐│ Alert │ │ Route │ │ Silence ││ Rules │ │ Tree │ │ Manager │└─────────┘ └───────┘ └─────────┘
组件职责分工
-
Prometheus Server:
- 执行告警规则评估
- 生成告警实例
- 向Alertmanager发送告警
-
Alertmanager:
- 接收Prometheus的告警
- 对告警进行分组、抑制、静默
- 路由告警到不同的接收器
- 管理告警的生命周期
-
Alert Rules:
- 定义告警条件
- 配置告警标签和注释
- 设置告警持续时间
1.2 告警流程详解#注释lwh-csdn
告警状态转换
Inactive ──condition_met──▶ Pending ──for_duration──▶ Firing▲ │ ││ │ │└──condition_not_met───────┴────condition_not_met──┘
- Inactive: 告警条件未满足
- Pending: 告警条件满足但未达到持续时间
- Firing: 告警条件满足且超过持续时间,开始发送通知
二、告警规则配置
2.1 告警规则基础语法
规则文件结构
# alert_rules.yml#注释lwh-csdn
groups:- name: "基础系统告警"rules:- alert: "高CPU使用率"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 2mlabels:severity: warningteam: infrastructureannotations:summary: "实例 {{ $labels.instance }} CPU使用率过高"description: "{{ $labels.instance }} 的CPU使用率为 {{ $value }}%,已超过80%阈值"- name: "应用服务告警#注释lwh-csdn"rules:- alert: "服务不可用"expr: up == 0for: 1mlabels:severity: criticalteam: sreannotations:summary: "服务 {{ $labels.job }} 不可用"description: "服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上已停止运行"
规则配置要素
# 完整的告警规则示例
- alert: "AlertName" # 告警名称(必需)expr: prometheus_query_expression # PromQL表达式(必需)for: 5m # 持续时间(可选,默认0)labels: # 告警标签(可选)severity: warningteam: platformenv: productionannotations: # 告警注释(可选)summary: "简短描述"description: "详细描述,支持模板变量"runbook_url: "https://runbook.example.com/alerts/alert-name"dashboard_url: "https://grafana.example.com/dashboard"
2.2 常用系统告警规则
CPU相关告警
groups:- name: "cpu_alerts"rules:# CPU使用率告警- alert: "HighCPUUsage"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80for: 2mlabels:severity: warningcategory: systemannotations:summary: "高CPU使用率警告"description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过80%"# CPU负载告警- alert: "HighCPULoad"expr: node_load5 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 1.5for: 5mlabels:severity: warningcategory: systemannotations:summary: "高CPU负载警告"description: "实例 {{ $labels.instance }} 5分钟负载 {{ printf \"%.2f\" $value }} 超过CPU核心数的1.5倍"# 极高CPU使用率告警- alert: "CriticalCPUUsage"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 95for: 1mlabels:severity: criticalcategory: systemannotations:summary: "严重CPU使用率告警"description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过95%,需要立即处理"
内存相关告警#注释lwh-csdn
groups:
-
name: “memory_alerts”
rules:内存使用率告警
- alert: “HighMemoryUsage”
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 2m
labels:
severity: warning
category: system
annotations:
summary: “高内存使用率警告#注释lwh-csdn”
description: “实例 {{ $labels.instance }} 内存使用率 {{ printf “%.2f” $value }}% 超过80%”
内存不足告警
- alert: “LowMemoryAvailable”
expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 < 1
for: 1m
labels:
severity: critical
category: system
annotations:
summary: “内存严重不足”
description: “实例 {{ $labels.instance }} 可用内存仅剩 {{ printf “%.2f” $value }}GB”
Swap使用率告警
- alert: “HighSwapUsage”
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 50
for: 5m
labels:
severity: warning
category: system
annotations:
summary: “高Swap使用率警告#注释lwh-csdn”
description: “实例 {{ $labels.instance }} Swap使用率 {{ printf “%.2f” $value }}% 超过50%”
- alert: “HighMemoryUsage”
磁盘相关告警
groups:
-
name: “disk_alerts”
rules:磁盘空间使用率告警
- alert: “HighDiskUsage”
expr: (1 - (node_filesystem_free_bytes{fstype!=“tmpfs”} / node_filesystem_size_bytes{fstype!=“tmpfs”})) * 100 > 80
for: 2m
labels:
severity: warning
category: system
annotations:
summary: “高磁盘使用率警告#注释lwh-csdn”
description: “实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率 {{ printf “%.2f” $value }}% 超过80%”
磁盘空间即将耗尽
- alert: “DiskSpaceLow”
expr: node_filesystem_free_bytes{fstype!=“tmpfs”} / 1024 / 1024 / 1024 < 5
for: 1m
labels:
severity: critical
category: system
annotations:
summary: “磁盘空间严重不足#注释lwh-csdn”
description: “实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 剩余空间仅 {{ printf “%.2f” $value }}GB”
磁盘IO延迟告警
- alert: “HighDiskIOLatency”
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 3m
labels:
severity: warning
category: system
annotations:
summary: “高磁盘IO延迟”
description: “实例 {{ $labels.instance }} 磁盘 {{ $labels.device }} IO延迟 {{ printf “%.2f” $value }}% 超过80%”
磁盘预测告警
- alert: “DiskWillFillSoon”
expr: predict_linear(node_filesystem_free_bytes{fstype!=“tmpfs”}[1h], 4 * 3600) <= 0
for: 5m
labels:
severity: warning
category: system
annotations:
summary: “磁盘空间预警#注释lwh-csdn”
description: “根据当前趋势,实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 将在4小时内用完”
- alert: “HighDiskUsage”
网络相关告警
groups:
-
name: “network_alerts”
rules:网络接口状态告警
- alert: “NetworkInterfaceDown”
expr: node_network_up{device!=“lo”} == 0
for: 1m
labels:
severity: warning
category: network
annotations:
summary: “网络接口故障#注释lwh-csdn”
description: “实例 {{ $labels.instance }} 网络接口 {{ $labels.device }} 已断开”
高网络错误率
- alert: “HighNetworkErrors”
expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) > 10
for: 2m
labels:
severity: warning
category: network
annotations:
summary: “高网络错误率”
description: “实例 {{ $labels.instance }} 接口 {{ $labels.device }} 网络错误率 {{ printf “%.2f” $value }} errors/sec”
高带宽使用率
- alert: “HighBandwidthUsage”
expr: rate(node_network_transmit_bytes_total{device!=“lo”}[5m]) * 8 / 1000 / 1000 / 1000 > 0.8
for: 3m
labels:
severity: warning
category: network
annotations:
summary: “高带宽使用率”
description: “实例 {{ $labels.instance }} 接口 {{ $labels.device }} 出口带宽使用率 {{ printf “%.2f” $value }}Gbps”
- alert: “NetworkInterfaceDown”
2.3 应用服务告警规则
HTTP服务告警#注释lwh-csdn
groups:
-
name: “http_service_alerts”
rules:服务不可用
- alert: “ServiceDown”
expr: up == 0
for: 1m
labels:
severity: critical
category: service
annotations:
summary: “服务不可用”
description: “服务 {{ $labels.job }} 实例 {{ $labels.instance }} 已停止响应”
高错误率
- alert: “HighErrorRate”
expr: (sum(rate(http_requests_total{status=~“5…”}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100 > 5
for: 2m
labels:
severity: warning
category: application
annotations:
summary: “高HTTP错误率”
description: “服务 {{ $labels.job }} HTTP 5xx错误率 {{ printf “%.2f” $value }}% 超过5%”
高响应时间#注释lwh-csdn
- alert: “HighResponseTime”
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1
for: 3m
labels:
severity: warning
category: application
annotations:
summary: “高响应时间”
description: “服务 {{ $labels.job }} P95响应时间 {{ printf “%.2f” $value }}s 超过1秒”
低QPS告警
- alert: “LowRequestRate”
expr: sum(rate(http_requests_total[5m])) by (job) < 10
for: 5m
labels:
severity: info
category: application
annotations:
summary: “低请求率”
description: “服务 {{ $labels.job }} QPS {{ printf “%.2f” $value }} 低于正常水平”
异常流量激增
- alert: “TrafficSpike”
expr: sum(rate(http_requests_total[5m])) by (job) / avg_over_time(sum(rate(http_requests_total[5m])) by (job)[1h:5m]) > 3
for: 2m
labels:
severity: warning
category: application
annotations:
summary: “流量异常激增”
description: “服务 {{ $labels.job }} 当前QPS是过去1小时平均值的 {{ printf “%.2f” $value }} 倍”
- alert: “ServiceDown”
数据库告警
groups:
-
name: “database_alerts”
rules:MySQL连接数告警
- alert: “MySQLHighConnections”
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 2m
labels:
severity: warning
category: database
annotations:
summary: “MySQL连接数过高”
description: “MySQL实例 {{ $labels.instance }} 连接使用率 {{ printf “%.2f” $value }}% 超过80%”
MySQL慢查询告警#注释lwh-csdn
- alert: “MySQLSlowQueries”
expr: rate(mysql_global_status_slow_queries[5m]) > 5
for: 2m
labels:
severity: warning
category: database
annotations:
summary: “MySQL慢查询过多”
description: “MySQL实例 {{ $labels.instance }} 慢查询速率 {{ printf “%.2f” $value }} queries/sec”
Redis内存使用告警
- alert: “RedisHighMemoryUsage”
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 2m
labels:
severity: warning
category: database
annotations:
summary: “Redis内存使用过高”
description: “Redis实例 {{ $labels.instance }} 内存使用率 {{ printf “%.2f” $value }}% 超过80%”
Redis连接数告警
- alert: “RedisHighConnections”
expr: redis_connected_clients > 100
for: 2m
labels:
severity: warning
category: database
annotations:
summary: “Redis连接数过高”
description: “Redis实例 {{ $labels.instance }} 当前连接数 {{ $value }} 超过100”
- alert: “MySQLHighConnections”
2.4 业务告警规则
业务指标告警#注释lwh-csdn
groups:
-
name: “business_alerts”
rules:订单量异常
- alert: “LowOrderRate”
expr: rate(orders_total[10m]) < 5
for: 5m
labels:
severity: warning
category: business
annotations:
summary: “订单量异常偏低”
description: “当前订单速率 {{ printf “%.2f” $value }} orders/min 低于正常水平”
支付失败率过高#注释lwh-csdn
- alert: “HighPaymentFailureRate”
expr: (sum(rate(payments_total{status=“failed”}[5m])) / sum(rate(payments_total[5m]))) * 100 > 2
for: 2m
labels:
severity: critical
category: business
annotations:
summary: “支付失败率过高”
description: “支付失败率 {{ printf “%.2f” $value }}% 超过2%阈值”
用户注册异常#注释lwh-csdn
- alert: “UserRegistrationAnomaly”
expr: rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) > 5 or rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) < 0.2
for: 3m
labels:
severity: warning
category: business
annotations:
summary: “用户注册异常”
description: “用户注册率异常,当前值与历史平均值比例为 {{ printf “%.2f” $value }}”
- alert: “LowOrderRate”