当前位置: 首页 > news >正文

Prometheus 05-01: 告警规则与Alertmanager配置

Prometheus 05-01: 告警规则与Alertmanager配置

相关文档链接

告警规则设计、Alertmanager配置和通知集成

官方文档资源

  • Alerting Overview - Prometheus告警系统概述
  • Alerting Rules - 告警规则配置文档
  • Alertmanager Configuration - Alertmanager完整配置参考
  • Notification Template Reference - 通知模板参考文档

GitHub项目资源

  • Awesome Prometheus Alerts - 常用告警规则集合
  • Alertmanager - Alertmanager项目主页
  • Alert Rules Examples - 官方告警规则示例
  • Prometheus Operator Alerts - K8s环境告警规则

中文资源和教程

  • Prometheus告警系统 - 云原生社区告警教程
  • Alertmanager配置详解 - 详细的中文配置指南
  • 告警规则最佳实践 - 告警配置最佳实践
  • 微信钉钉告警集成 - 中文消息通知集成

在线工具和资源

  • Alertmanager WebUI - Alertmanager Web界面文档
  • PrometheusRule Validator - 告警规则验证工具
  • Alert Rule Testing - 规则单元测试
  • Grafana Alerting - Grafana集成告警功能

一、告警系统架构概述

1.1 Prometheus告警系统组件

Prometheus告警系统由以下核心组件构成:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│                 │    │                 │    │                 │
│   Prometheus    │───▶│  Alertmanager   │───▶│   Receivers     │
│     Server      │    │                 │    │ (Email/Slack/   │
│                 │    │                 │    │  WebHook etc.)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘│                       │                       ││                       │                       │┌────▼────┐              ┌───▼───┐              ┌────▼────┐│ Alert   │              │ Route │              │ Silence ││ Rules   │              │ Tree  │              │ Manager │└─────────┘              └───────┘              └─────────┘
组件职责分工
  1. Prometheus Server

    • 执行告警规则评估
    • 生成告警实例
    • 向Alertmanager发送告警
  2. Alertmanager

    • 接收Prometheus的告警
    • 对告警进行分组、抑制、静默
    • 路由告警到不同的接收器
    • 管理告警的生命周期
  3. Alert Rules

    • 定义告警条件
    • 配置告警标签和注释
    • 设置告警持续时间

1.2 告警流程详解#注释lwh-csdn

Yes
No
Yes
No
Metrics Collection
Rule Evaluation
Alert Condition Met?
Generate Alert
Continue Monitoring
Send to Alertmanager
Grouping
Routing
Inhibition Check
Silence Check
Send Notification?
Send to Receiver
Suppress Alert
Notification Sent
Wait for Next Cycle
告警状态转换
Inactive ──condition_met──▶ Pending ──for_duration──▶ Firing▲                          │                        ││                          │                        │└──condition_not_met───────┴────condition_not_met──┘
  • Inactive: 告警条件未满足
  • Pending: 告警条件满足但未达到持续时间
  • Firing: 告警条件满足且超过持续时间,开始发送通知

二、告警规则配置

2.1 告警规则基础语法

规则文件结构
# alert_rules.yml#注释lwh-csdn
groups:- name: "基础系统告警"rules:- alert: "高CPU使用率"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 2mlabels:severity: warningteam: infrastructureannotations:summary: "实例 {{ $labels.instance }} CPU使用率过高"description: "{{ $labels.instance }} 的CPU使用率为 {{ $value }}%,已超过80%阈值"- name: "应用服务告警#注释lwh-csdn"rules:- alert: "服务不可用"expr: up == 0for: 1mlabels:severity: criticalteam: sreannotations:summary: "服务 {{ $labels.job }} 不可用"description: "服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上已停止运行"
规则配置要素
# 完整的告警规则示例
- alert: "AlertName"                    # 告警名称(必需)expr: prometheus_query_expression     # PromQL表达式(必需)for: 5m                              # 持续时间(可选,默认0)labels:                              # 告警标签(可选)severity: warningteam: platformenv: productionannotations:                         # 告警注释(可选)summary: "简短描述"description: "详细描述,支持模板变量"runbook_url: "https://runbook.example.com/alerts/alert-name"dashboard_url: "https://grafana.example.com/dashboard"

2.2 常用系统告警规则

CPU相关告警
groups:- name: "cpu_alerts"rules:# CPU使用率告警- alert: "HighCPUUsage"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80for: 2mlabels:severity: warningcategory: systemannotations:summary: "高CPU使用率警告"description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过80%"# CPU负载告警- alert: "HighCPULoad"expr: node_load5 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 1.5for: 5mlabels:severity: warningcategory: systemannotations:summary: "高CPU负载警告"description: "实例 {{ $labels.instance }} 5分钟负载 {{ printf \"%.2f\" $value }} 超过CPU核心数的1.5倍"# 极高CPU使用率告警- alert: "CriticalCPUUsage"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 95for: 1mlabels:severity: criticalcategory: systemannotations:summary: "严重CPU使用率告警"description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过95%,需要立即处理"
内存相关告警#注释lwh-csdn

groups:

  • name: “memory_alerts”
    rules:

    内存使用率告警

    • alert: “HighMemoryUsage”
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: “高内存使用率警告#注释lwh-csdn”
      description: “实例 {{ $labels.instance }} 内存使用率 {{ printf “%.2f” $value }}% 超过80%”

    内存不足告警

    • alert: “LowMemoryAvailable”
      expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 < 1
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: “内存严重不足”
      description: “实例 {{ $labels.instance }} 可用内存仅剩 {{ printf “%.2f” $value }}GB”

    Swap使用率告警

    • alert: “HighSwapUsage”
      expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 50
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: “高Swap使用率警告#注释lwh-csdn”
      description: “实例 {{ $labels.instance }} Swap使用率 {{ printf “%.2f” $value }}% 超过50%”
磁盘相关告警

groups:

  • name: “disk_alerts”
    rules:

    磁盘空间使用率告警

    • alert: “HighDiskUsage”
      expr: (1 - (node_filesystem_free_bytes{fstype!=“tmpfs”} / node_filesystem_size_bytes{fstype!=“tmpfs”})) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: “高磁盘使用率警告#注释lwh-csdn”
      description: “实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率 {{ printf “%.2f” $value }}% 超过80%”

    磁盘空间即将耗尽

    • alert: “DiskSpaceLow”
      expr: node_filesystem_free_bytes{fstype!=“tmpfs”} / 1024 / 1024 / 1024 < 5
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: “磁盘空间严重不足#注释lwh-csdn”
      description: “实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 剩余空间仅 {{ printf “%.2f” $value }}GB”

    磁盘IO延迟告警

    • alert: “HighDiskIOLatency”
      expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
      for: 3m
      labels:
      severity: warning
      category: system
      annotations:
      summary: “高磁盘IO延迟”
      description: “实例 {{ $labels.instance }} 磁盘 {{ $labels.device }} IO延迟 {{ printf “%.2f” $value }}% 超过80%”

    磁盘预测告警

    • alert: “DiskWillFillSoon”
      expr: predict_linear(node_filesystem_free_bytes{fstype!=“tmpfs”}[1h], 4 * 3600) <= 0
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: “磁盘空间预警#注释lwh-csdn”
      description: “根据当前趋势,实例 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 将在4小时内用完”
网络相关告警

groups:

  • name: “network_alerts”
    rules:

    网络接口状态告警

    • alert: “NetworkInterfaceDown”
      expr: node_network_up{device!=“lo”} == 0
      for: 1m
      labels:
      severity: warning
      category: network
      annotations:
      summary: “网络接口故障#注释lwh-csdn”
      description: “实例 {{ $labels.instance }} 网络接口 {{ $labels.device }} 已断开”

    高网络错误率

    • alert: “HighNetworkErrors”
      expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) > 10
      for: 2m
      labels:
      severity: warning
      category: network
      annotations:
      summary: “高网络错误率”
      description: “实例 {{ $labels.instance }} 接口 {{ $labels.device }} 网络错误率 {{ printf “%.2f” $value }} errors/sec”

    高带宽使用率

    • alert: “HighBandwidthUsage”
      expr: rate(node_network_transmit_bytes_total{device!=“lo”}[5m]) * 8 / 1000 / 1000 / 1000 > 0.8
      for: 3m
      labels:
      severity: warning
      category: network
      annotations:
      summary: “高带宽使用率”
      description: “实例 {{ $labels.instance }} 接口 {{ $labels.device }} 出口带宽使用率 {{ printf “%.2f” $value }}Gbps”

2.3 应用服务告警规则

HTTP服务告警#注释lwh-csdn

groups:

  • name: “http_service_alerts”
    rules:

    服务不可用

    • alert: “ServiceDown”
      expr: up == 0
      for: 1m
      labels:
      severity: critical
      category: service
      annotations:
      summary: “服务不可用”
      description: “服务 {{ $labels.job }} 实例 {{ $labels.instance }} 已停止响应”

    高错误率

    • alert: “HighErrorRate”
      expr: (sum(rate(http_requests_total{status=~“5…”}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100 > 5
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: “高HTTP错误率”
      description: “服务 {{ $labels.job }} HTTP 5xx错误率 {{ printf “%.2f” $value }}% 超过5%”

    高响应时间#注释lwh-csdn

    • alert: “HighResponseTime”
      expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1
      for: 3m
      labels:
      severity: warning
      category: application
      annotations:
      summary: “高响应时间”
      description: “服务 {{ $labels.job }} P95响应时间 {{ printf “%.2f” $value }}s 超过1秒”

    低QPS告警

    • alert: “LowRequestRate”
      expr: sum(rate(http_requests_total[5m])) by (job) < 10
      for: 5m
      labels:
      severity: info
      category: application
      annotations:
      summary: “低请求率”
      description: “服务 {{ $labels.job }} QPS {{ printf “%.2f” $value }} 低于正常水平”

    异常流量激增

    • alert: “TrafficSpike”
      expr: sum(rate(http_requests_total[5m])) by (job) / avg_over_time(sum(rate(http_requests_total[5m])) by (job)[1h:5m]) > 3
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: “流量异常激增”
      description: “服务 {{ $labels.job }} 当前QPS是过去1小时平均值的 {{ printf “%.2f” $value }} 倍”
数据库告警

groups:

  • name: “database_alerts”
    rules:

    MySQL连接数告警

    • alert: “MySQLHighConnections”
      expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: “MySQL连接数过高”
      description: “MySQL实例 {{ $labels.instance }} 连接使用率 {{ printf “%.2f” $value }}% 超过80%”

    MySQL慢查询告警#注释lwh-csdn

    • alert: “MySQLSlowQueries”
      expr: rate(mysql_global_status_slow_queries[5m]) > 5
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: “MySQL慢查询过多”
      description: “MySQL实例 {{ $labels.instance }} 慢查询速率 {{ printf “%.2f” $value }} queries/sec”

    Redis内存使用告警

    • alert: “RedisHighMemoryUsage”
      expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: “Redis内存使用过高”
      description: “Redis实例 {{ $labels.instance }} 内存使用率 {{ printf “%.2f” $value }}% 超过80%”

    Redis连接数告警

    • alert: “RedisHighConnections”
      expr: redis_connected_clients > 100
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: “Redis连接数过高”
      description: “Redis实例 {{ $labels.instance }} 当前连接数 {{ $value }} 超过100”

2.4 业务告警规则

业务指标告警#注释lwh-csdn

groups:

  • name: “business_alerts”
    rules:

    订单量异常

    • alert: “LowOrderRate”
      expr: rate(orders_total[10m]) < 5
      for: 5m
      labels:
      severity: warning
      category: business
      annotations:
      summary: “订单量异常偏低”
      description: “当前订单速率 {{ printf “%.2f” $value }} orders/min 低于正常水平”

    支付失败率过高#注释lwh-csdn

    • alert: “HighPaymentFailureRate”
      expr: (sum(rate(payments_total{status=“failed”}[5m])) / sum(rate(payments_total[5m]))) * 100 > 2
      for: 2m
      labels:
      severity: critical
      category: business
      annotations:
      summary: “支付失败率过高”
      description: “支付失败率 {{ printf “%.2f” $value }}% 超过2%阈值”

    用户注册异常#注释lwh-csdn

    • alert: “UserRegistrationAnomaly”
      expr: rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) > 5 or rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) < 0.2
      for: 3m
      labels:
      severity: warning
      category: business
      annotations:
      summary: “用户注册异常”
      description: “用户注册率异常,当前值与历史平均值比例为 {{ printf “%.2f” $value }}”
http://www.dtcms.com/a/445577.html

相关文章:

  • 【Linux】Mysql的基本文件组成和配置
  • 简单易用!NAS+Leantime,开源轻量级项目管理,高效协作一键开启
  • 大数据毕业设计选题推荐-基于大数据的全球用水量数据可视化分析系统-大数据-Spark-Hadoop-Bigdata
  • NLP:迁移学习关于领域自适应的基础讲解
  • 在运行中的 Kafka 集群渐进式启用安全零停机实战手册(KRaft/Broker 通用)
  • 网站手机版制作白嫖永久服务器
  • 用一个 Bash CLI 管理多款 AI 开发工具:jt-code-cli 实战与原理解析
  • Linux《线程同步和互斥(下)》
  • 百丽企业数字化转型失败案例分析及其AI智能名片S2B2C商城小程序的适用性探讨
  • 【STM32项目开源】基于STM32的智能宠物防丢监控系统
  • UV紫外相机在工业视觉检测中的应用
  • Redis-UV统计(HyperLogLog)
  • PHP 8.0+ 极限性能优化与系统级编程
  • Deep Learning Optimizer | Adam、AdamW
  • 【linux】linux的扩充指令的学习
  • vim保姆级使用,操作详解,快捷键大全总结
  • jmr119色带贵港seo
  • NLP:迁移学习基础讲解
  • 10.5 数位dp
  • 基于汽车钣金理念的门窗柔性生产系统重构方案
  • 做网站要哪些技术查企业法人信息查询平台
  • Go语言入门(20)-nil
  • Go基础:Go语言ORM框架GORM详解
  • 备案 网站备注网站用的服务器多少钱
  • 《API网关在智能制造产线协同中的定制化实践与可靠性重构》
  • 建设网站的调研报告校园电子商务网站建设规划书实例
  • 书生浦语第六期 L1-G2000
  • AI大事记9:从 AlexNet 到 ChatGPT——深度学习的十年跃迁(上)
  • 删除无限递归文件夹
  • PyCharm 核心快捷键大全 (Windows版)