Prometheus 05-02: 告警规则与Alertmanager配置
Prometheus 05-02: 告警规则与Alertmanager配置
三、Alertmanager配置
3.1 Alertmanager基础配置
主配置文件结构
alertmanager.yml#注释lwh-csdn
global:
全局配置
smtp_smarthost: ‘smtp.example.com:587’
smtp_from: ‘alertmanager@example.com’
smtp_auth_username: ‘alertmanager@example.com’
smtp_auth_password: ‘password’
smtp_hello: ‘alertmanager.example.com’
默认接收器
default_receiver: ‘default-receiver’
全局模板
templates:
- ‘/etc/alertmanager/templates/*.tmpl’
路由配置
route:
根路由
receiver: ‘default-receiver’
group_by: [‘alertname’, ‘cluster’, ‘service’]
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
子路由
routes:
# 严重告警路由
- match:
severity: critical
receiver: ‘critical-receiver’
group_wait: 5s
repeat_interval: 5m
# 基础设施告警路由
- match:team: infrastructurereceiver: 'infra-team'group_by: ['instance']# 应用告警路由
- match:category: applicationreceiver: 'app-team'group_by: ['job']
抑制规则
inhibit_rules:
- source_match:
severity: ‘critical’
target_match:
severity: ‘warning’
equal: [‘alertname’, ‘instance’]
接收器配置#注释lwh-csdn
receivers:
-
name: ‘default-receiver’
email_configs:- to: ‘admin@example.com’
subject: ‘[Alert] {{ .GroupLabels.alertname }}’
body: |
{{ range .Alerts }}
告警: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
时间: {{ .StartsAt.Format “2006-01-02 15:04:05” }}
{{ end }}
- to: ‘admin@example.com’
-
name: ‘critical-receiver’
email_configs:- to: ‘oncall@example.com’
subject: ‘[CRITICAL] {{ .GroupLabels.alertname }}’
slack_configs: - api_url: ‘https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK’
channel: ‘#alerts’
title: ‘Critical Alert’
text: ‘{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}’
- to: ‘oncall@example.com’
-
name: ‘infra-team’
email_configs:- to: ‘infra-team@example.com’
-
name: ‘app-team’
email_configs:- to: ‘app-team@example.com’
3.2 高级路由配置
复杂路由规则
route:
receiver: ‘default’
group_by: [‘alertname’]
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
routes:
# 按环境路由
- match:
env: production
receiver: ‘prod-team’
group_wait: 5s
repeat_interval: 1h
routes:
# 生产环境严重告警#注释lwh-csdn
- match:
severity: critical
receiver: ‘prod-oncall’
group_wait: 0s
repeat_interval: 5m
# 生产环境数据库告警- match:category: databasereceiver: 'dba-team'# 测试环境告警
- match:env: testreceiver: 'dev-team'repeat_interval: 4h# 按服务类型路由
- match_re:service: '(frontend|backend|api).*'receiver: 'app-team'group_by: ['service', 'severity']# 按团队路由
- match:team: securityreceiver: 'security-team'group_by: ['alertname']# 业务告警路由
- match:category: businessreceiver: 'business-team'group_wait: 1mgroup_interval: 10mrepeat_interval: 30m# 静默周末告警(非严重)
- match:severity: warningreceiver: 'weekend-receiver'# 使用时间过滤(需要通过外部工具实现)
动态路由配置
使用正则表达式的动态路由
route:
routes:
# 按实例IP段路由#注释lwh-csdn
- match_re:
instance: ‘192.168.1…*’
receiver: ‘datacenter-a’
- match_re:instance: '192\.168\.2\..*'receiver: 'datacenter-b'# 按告警名称模式路由#注释lwh-csdn
- match_re:alertname: '.*Database.*'receiver: 'dba-team'- match_re:alertname: '.*Network.*'receiver: 'network-team'# 按标签组合路由
- match:team: frontendseverity: criticalreceiver: 'frontend-oncall'continue: true # 继续匹配其他路由- match:team: frontendreceiver: 'frontend-team'
3.3 告警抑制和静默
抑制规则配置#注释lwh-csdn
inhibit_rules:
实例宕机时抑制该实例的其他告警#注释lwh-csdn
- source_match:
alertname: ‘InstanceDown’
target_match_re:
alertname: ‘(HighCPU|HighMemory|DiskSpaceLow).*’
equal: [‘instance’]
严重告警抑制同类型的警告告警
- source_match:
severity: ‘critical’
target_match:
severity: ‘warning’
equal: [‘alertname’, ‘instance’]
数据库主从切换时抑制相关告警
- source_match:
alertname: ‘DatabaseFailover’
target_match_re:
alertname: ‘(DatabaseConnection|DatabaseSlow).*’
equal: [‘cluster’]
网络故障时抑制应用层告警
- source_match:
category: ‘network’
severity: ‘critical’
target_match:
category: ‘application’
equal: [‘instance’]
静默管理
通过API创建静默
curl -XPOST http://alertmanager:9093/api/v1/silences
-H “Content-Type: application/json”
-d ‘{
“matchers”: [
{
“name”: “alertname”,
“value”: “HighCPUUsage”,
“isRegex”: false
},
{
“name”: “instance”,
“value”: “server-01”,
“isRegex”: false
}
],
“startsAt”: “2023-01-01T00:00:00.000Z”,
“endsAt”: “2023-01-01T02:00:00.000Z”,
“createdBy”: “admin”,
“comment”: “维护期间静默CPU告警”
}’
查看所有静默
curl http://alertmanager:9093/api/v1/silences
删除静默
curl -XDELETE http://alertmanager:9093/api/v1/silence/SILENCE_ID
3.4 通知接收器配置
邮件通知配置
receivers:
-
name: ‘email-receiver’
email_configs:- to: ‘alerts@example.com’
from: ‘alertmanager@example.com’
subject: ‘[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}’
html: |Alert Notification 告警通知
{{ range .Alerts }}{{ .Annotations.summary }}
描述: {{ .Annotations.description }}
级别: {{ .Labels.severity }}
实例: {{ .Labels.instance }}
开始时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if .Annotations.runbook_url }}运维手册
{{ end }}
- to: ‘alerts@example.com’
-
name: ‘simple-email’
email_configs:-
to: ‘team@example.com’
subject: ‘告警: {{ .GroupLabels.alertname }}’
body: |
告警组: {{ .GroupLabels.alertname }}
状态: {{ .Status }}{{ range .Alerts }}
告警: {{ .Annotations.summary }}
描述: {{ .Annotations.description }}
实例: {{ .Labels.instance }}
级别: {{ .Labels.severity }}
开始时间: {{ .StartsAt.Format “2006-01-02 15:04:05” }}
{{ if ne .Status “resolved” }}当前值: {{ .Annotations.value }}{{ end }}{{ end }}
headers:
From: ‘alertmanager@example.com’
To: ‘team@example.com’
-
Slack通知配置
receivers:- name: 'slack-receiver'slack_configs:- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'channel: '#alerts'username: 'Alertmanager'icon_emoji: ':bell:'title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .GroupLabels.alertname }}'title_link: 'http://alertmanager.example.com'text: |{{ range .Alerts }}*告警:* {{ .Annotations.summary }}*描述:* {{ .Annotations.description }}*级别:* {{ .Labels.severity }}*实例:* {{ .Labels.instance }}*时间:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}{{ if .Annotations.runbook_url }}*运维手册:* <{{ .Annotations.runbook_url }}|查看>{{ end }}{{ end }}color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'# 按严重级别设置不同颜色# color: |# {{ if eq .Status "firing" }}# {{ if eq .CommonLabels.severity "critical" }}danger# {{ else if eq .CommonLabels.severity "warning" }}warning# {{ else }}good{{ end }}# {{ else }}good{{ end }}- name: 'slack-critical'slack_configs:- api_url: 'https://hooks.slack.com/services/YOUR/CRITICAL/WEBHOOK'channel: '#critical-alerts'username: 'Critical Alert'icon_emoji: ':rotating_light:'title: '🚨 CRITICAL ALERT 🚨'text: |<!channel> 严重告警需要立即处理!{{ range .Alerts }}*告警:* {{ .Annotations.summary }}*描述:* {{ .Annotations.description }}*实例:* {{ .Labels.instance }}*开始时间:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}{{ end }}color: 'danger'actions:- type: buttontext: '查看详情'url: 'http://alertmanager.example.com'- type: buttontext: '静默告警'url: 'http://alertmanager.example.com/#/silences/new'
钉钉通知配置
receivers:
- name: ‘dingtalk-receiver’
webhook_configs:-
url: ‘https://oapi.dingtalk.com/robot/send?access_token=YOUR_ACCESS_TOKEN’
send_resolved: true
http_config:
proxy_url: ‘http://proxyserver:port’ # 如果需要代理
max_alerts: 5 # 最多发送5个告警详情钉钉机器人消息格式
title: ‘{{ if eq .Status “firing” }}告警{{ else }}恢复{{ end }}: {{ .GroupLabels.alertname }}’
message: |
{{ if eq .Status “firing” }}🔥 告警通知
{{ else }}
✅ 告警恢复
{{ end }}
告警组: {{ .GroupLabels.alertname }}
状态: {{ .Status }}
告警数量: {{ len .Alerts }}{{ range $index, $alert := .Alerts }}
{{ if lt $index 5 }}告警 {{ add $index 1 }}:
- 摘要: {{ $alert.Annotations.summary }}
- 描述: {{ $alert.Annotations.description }}
- 级别: {{ $alert.Labels.severity }}
- 实例: {{ $alert.Labels.instance }}
- 时间: {{ $alert.StartsAt.Format “2006-01-02 15:04:05” }}
{{ if $alert.Annotations.runbook_url }} - 手册: [查看运维手册]({{ $alert.Annotations.runbook_url }})
{{ end }}
{{ end }}
{{ end }}
{{ if gt (len .Alerts) 5 }}
还有 {{ sub (len .Alerts) 5 }} 个告警,请到 Alertmanager 查看详情
{{ end }}
-
自定义钉钉Webhook处理器(需要额外开发)
- name: ‘dingtalk-custom’
webhook_configs:- url: ‘http://your-webhook-handler.com/dingtalk’
send_resolved: true
http_config:
basic_auth:
username: ‘webhook_user’
password: ‘webhook_pass’
- url: ‘http://your-webhook-handler.com/dingtalk’
#### 微信通知配置```yaml#注释lwh-csdn
receivers:- name: 'wechat-receiver'wechat_configs:- corp_id: 'YOUR_CORP_ID'agent_id: 1000001api_secret: 'YOUR_API_SECRET'to_user: '@all' # 或者指定用户IDto_party: '1' # 部门IDto_tag: '1' # 标签IDmessage: |{{ if eq .Status "firing" }}告警{{ else }}恢复{{ end }}: {{ .GroupLabels.alertname }}{{ range .Alerts }}告警: {{ .Annotations.summary }}描述: {{ .Annotations.description }}级别: {{ .Labels.severity }}实例: {{ .Labels.instance }}时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}{{ end }}# 企业微信群机器人- name: 'wechat-robot'webhook_configs:- url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_BOT_KEY'send_resolved: truetitle: '{{ if eq .Status "firing" }}🔔 告警通知{{ else }}✅ 告警恢复{{ end }}'message: |{"msgtype": "markdown","markdown": {"content": "## {{ if eq .Status "firing" }}🔔 告警通知{{ else }}✅ 告警恢复{{ end }}\n{{ range .Alerts }}**告警:** {{ .Annotations.summary }}\n**描述:** {{ .Annotations.description }}\n**级别:** <font color=\"{{ if eq .Labels.severity \"critical\" }}red{{ else if eq .Labels.severity \"warning\" }}orange{{ else }}blue{{ end }}\">{{ .Labels.severity }}</font>\n**实例:** {{ .Labels.instance }}\n**时间:** {{ .StartsAt.Format \"2006-01-02 15:04:05\" }}\n---\n{{ end }}"}}
WebHook通知配置
receivers:- name: 'webhook-receiver'webhook_configs:- url: 'http://webhook-handler.example.com/alerts'send_resolved: truehttp_config:bearer_token: 'YOUR_BEARER_TOKEN'# 或使用基本认证#注释lwh-csdn# basic_auth:# username: 'webhook_user'# password: 'webhook_pass'# TLS配置#注释lwh-csdntls_config:insecure_skip_verify: falseca_file: '/etc/ssl/certs/ca.pem'cert_file: '/etc/ssl/certs/client.pem'key_file: '/etc/ssl/private/client.key'max_alerts: 0 # 0表示不限制# 自定义HTTP头title: 'Prometheus Alert'message: |{"receiver": "{{ .Receiver }}","status": "{{ .Status }}","alerts": [{{ range $index, $alert := .Alerts }}{{ if $index }},{{ end }}{"status": "{{ $alert.Status }}","labels": {{{ range $key, $value := $alert.Labels }}"{{ $key }}": "{{ $value }}",{{ end }}"dummy": ""},"annotations": {{{ range $key, $value := $alert.Annotations }}"{{ $key }}": "{{ $value }}",{{ end }}"dummy": ""},"startsAt": "{{ $alert.StartsAt }}","endsAt": "{{ $alert.EndsAt }}","generatorURL": "{{ $alert.GeneratorURL }}"}{{ end }}],"groupLabels": {{{ range $key, $value := .GroupLabels }}"{{ $key }}": "{{ $value }}",{{ end }}"dummy": ""},"commonLabels": {{{ range $key, $value := .CommonLabels }}"{{ $key }}": "{{ $value }}",{{ end }}"dummy": ""},"commonAnnotations": {{{ range $key, $value := .CommonAnnotations }}"{{ $key }}": "{{ $value }}",{{ end }}"dummy": ""},"externalURL": "{{ .ExternalURL }}"}- name: 'custom-webhook'webhook_configs:- url: 'http://custom-handler.example.com/prometheus-alerts'send_resolved: truetitle: 'Alert: {{ .GroupLabels.alertname }}'message: 'Status: {{ .Status }}, Alerts: {{ len .Alerts }}'
四、告警模板配置
4.1 自定义告警模板
创建模板文件
<!-- /etc/alertmanager/templates/custom.tmpl -->
{{ define "alert.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}{{ define "alert.content" }}
{{ if eq .Status "firing" }}
🔥 **告警触发**
{{ else }}
✅ **告警恢复**
{{ end }}**告警组:** {{ .GroupLabels.alertname }}
**状态:** {{ .Status | toUpper }}
**告警数量:** {{ len .Alerts }}{{ range $index, $alert := .Alerts }}
---
**告警 {{ add $index 1 }}:**
📋 **摘要#注释lwh-csdn:** {{ $alert.Annotations.summary }}
📝 **描述#注释lwh-csdn:** {{ $alert.Annotations.description }}
🏷️ **级别#注释lwh-csdn:** {{ $alert.Labels.severity | toUpper }}
🖥️ **实例#注释lwh-csdn:** {{ $alert.Labels.instance }}
⏰ **时间#注释lwh-csdn:** {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
{{ if $alert.Annotations.runbook_url }}
📖 **手册:** {{ $alert.Annotations.runbook_url }}
{{ end }}
{{ if $alert.Annotations.dashboard_url }}
📊 **仪表板#注释lwh-csdn:** {{ $alert.Annotations.dashboard_url }}
{{ end }}
{{ end }}{{ if .ExternalURL }}
🔗 **Alertmanager:** {{ .ExternalURL }}
{{ end }}
{{ end }}{{ define "alert.html" }}
<!DOCTYPE html>
<html>
<head><meta charset="UTF-8"><title>Alert Notification</title><style>body {font-family: Arial, sans-serif;line-height: 1.6;color: #333;max-width: 800px;margin: 0 auto;padding: 20px;}.header {background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);color: white;padding: 20px;border-radius: 8px;margin-bottom: 20px;}.alert-item {border: 1px solid #ddd;border-radius: 8px;margin: 10px 0;padding: 15px;}.critical { border-left: 5px solid #d32f2f; background-color: #ffebee; }.warning { border-left: 5px solid #f57c00; background-color: #fff3e0; }.info { border-left: 5px solid #1976d2; background-color: #e3f2fd; }.resolved { border-left: 5px solid #388e3c; background-color: #e8f5e8; }.label { font-weight: bold; color: #555; }.value { margin-left: 10px; }.footer {margin-top: 30px;padding: 15px;background-color: #f5f5f5;border-radius: 8px;text-align: center;}a { color: #1976d2; text-decoration: none; }a:hover { text-decoration: underline; }</style>
</head>
<body><div class="header"><h1>{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }}告警通知#注释lwh-csdn: {{ .GroupLabels.alertname }}</h1><p>状态: {{ .Status | toUpper }} | 告警数量: {{ len .Alerts }}</p></div>{{ range .Alerts }}<div class="alert-item {{ .Labels.severity }}{{ if eq .Status "resolved" }} resolved{{ end }}"><h3>{{ .Annotations.summary }}</h3><p><span class="label">描述:</span><span class="value">{{ .Annotations.description }}</span></p><p><span class="label">级别:</span><span class="value">{{ .Labels.severity | toUpper }}</span></p><p><span class="label">实例:</span><span class="value">{{ .Labels.instance }}</span></p><p><span class="label">开始时间:</span><span class="value">{{ .StartsAt.Format "2006-01-02 15:04:05" }}</span></p>{{ if ne .Status "resolved" }}<p><span class="label">持续时间:</span><span class="value">{{ .StartsAt | since }}</span></p>{{ else }}<p><span class="label">结束时间:</span><span class="value">{{ .EndsAt.Format "2006-01-02 15:04:05" }}</span></p>{{ end }}<div style="margin-top: 15px;">{{ if .Annotations.runbook_url }}<a href="{{ .Annotations.runbook_url }}" target="_blank">📖 运维手册</a>{{ end }}{{ if .Annotations.dashboard_url }}<a href="{{ .Annotations.dashboard_url }}" target="_blank">📊 监控面板</a>{{ end }}{{ if .GeneratorURL }}<a href="{{ .GeneratorURL }}" target="_blank">🔍 查看指标</a>{{ end }}</div></div>{{ end }}<div class="footer"><p>此告警由 <a href="{{ .ExternalURL }}">Alertmanager</a> 自动发送</p><p>发送时间: {{ now.Format "2006-01-02 15:04:05" }}</p></div>
</body>
</html>
{{ end }}
使用自定义模板
alertmanager.yml中引用模板
global:
templates:
- ‘/etc/alertmanager/templates/*.tmpl’
receivers:
-
name: ‘custom-email’
email_configs:- to: ‘team@example.com’
subject: ‘{{ template “alert.title” . }}’
html: ‘{{ template “alert.html” . }}’
- to: ‘team@example.com’
-
name: ‘custom-slack’
slack_configs:- api_url: ‘https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK’
channel: ‘#alerts’
title: ‘{{ template “alert.title” . }}’
text: ‘{{ template “alert.content” . }}’
- api_url: ‘https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK’
### 4.2 模板函数和变量#### 常用模板函数#注释lwh-csdn```go
<!-- 时间格式化 -->
{{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ .StartsAt | date "2006-01-02" }}<!-- 字符串处理 -->
{{ .Status | toUpper }} <!-- 转大写 -->
{{ .Status | toLower }} <!-- 转小写 -->
{{ .Status | title }} <!-- 首字母大写 --><!-- 数值计算 -->
{{ len .Alerts }} <!-- 数组长度 -->
{{ add $index 1 }} <!-- 加法 -->
{{ sub (len .Alerts) 5 }} <!-- 减法 --><!-- 条件判断 -->
{{ if eq .Status "firing" }}告警{{ else }}恢复{{ end }}
{{ if ne .Status "resolved" }}未恢复{{ end }}
{{ if gt (len .Alerts) 5 }}超过5个告警{{ end }}<!-- 循环和索引#注释lwh-csdn -->
{{ range $index, $alert := .Alerts }}
Alert {{ $index }}: {{ $alert.Annotations.summary }}
{{ end }}<!-- URL编码 -->
{{ .Annotations.summary | urlquery }}<!-- HTML转义 -->
{{ .Annotations.description | html }}<!-- 正则匹配 -->
{{ if reReplaceAll ".*error.*" .Annotations.description "ERROR" }}
<!-- 自定义处理 -->
{{ end }}
可用变量列表
// 根级别变量#注释lwh-csdn
.Receiver // 接收器名称
.Status // 告警状态: firing/resolved
.Alerts // 告警数组
.GroupLabels // 分组标签
.CommonLabels // 公共标签
.CommonAnnotations // 公共注释
.ExternalURL // Alertmanager外部URL
// 单个告警变量 (在.Alerts范围内)
.Status // 告警状态
.Labels // 告警标签
.Annotations // 告警注释
.StartsAt // 开始时间
.EndsAt // 结束时间
.GeneratorURL // 生成URL
.Fingerprint // 告警指纹
// 特殊函数
now // 当前时间
since // 距离现在的时间
## 五、规则测试和验证### 5.1 使用promtool验证规则#### 语法检查```bash
# 检查告警规则语法
promtool check rules alert_rules.yml# 检查Alertmanager配置#注释lwh-csdn
promtool check config alertmanager.yml# 验证PromQL表达式
promtool query instant http://prometheus:9090 'up'# 检查告警规则的PromQL表达式
promtool check rules --lint alert_rules.yml
单元测试
# alert_test.yml - 告警规则单元测试#注释lwh-csdn
rule_files:- alert_rules.ymlevaluation_interval: 1mtests:- interval: 1minput_series:- series: 'node_cpu_seconds_total{instance="localhost:9100", mode="idle"}'values: '100 95 90 85 80'- series: 'node_cpu_seconds_total{instance="localhost:9100", mode="system"}'values: '10 15 20 25 30'alert_rule_test:- eval_time: 0malertname: HighCPUUsageexp_alerts: []- eval_time: 2m alertname: HighCPUUsageexp_alerts:- exp_labels:severity: warningcategory: systeminstance: "localhost:9100"exp_annotations:summary: "高CPU使用率警告#注释lwh-csdn"description: "实例 localhost:9100 CPU使用率 85.00% 超过80%"- interval: 1minput_series:- series: 'up{instance="localhost:9100", job="node-exporter"}'values: '1 1 0 0 0'alert_rule_test:- eval_time: 0malertname: ServiceDown exp_alerts: []- eval_time: 3malertname: ServiceDownexp_alerts:- exp_labels:severity: criticalcategory: serviceinstance: "localhost:9100"job: "node-exporter"
# 运行单元测试#注释lwh-csdn
promtool test rules alert_test.yml
5.2 告警规则调试
调试技巧
# 查看当前活跃的告警#注释lwh-csdn
curl http://prometheus:9090/api/v1/alerts# 查看告警规则状态
curl http://prometheus:9090/api/v1/rules# 查看特定规则的评估结果
curl -G http://prometheus:9090/api/v1/query \--data-urlencode 'query=ALERTS{alertname="HighCPUUsage"}'# 手动测试PromQL表达式
curl -G http://prometheus:9090/api/v1/query \--data-urlencode 'query=100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100))'
常见问题排查
# 问题1: 告警规则不触发
# 检查点:
# 1. PromQL表达式是否正确
# 2. 时间窗口是否合适
# 3. 数据是否存在
# 4. 标签匹配是否正确# 调试示例
- alert: "DebugHighCPU"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100))for: 0s # 立即触发,方便调试labels:debug: "true"annotations:summary: "CPU使用率: {{ $value }}%"debug_info: "Labels: {{ $labels }}"# 问题2: 告警频繁触发和恢复
# 解决方案: 调整for持续时间和阈值
- alert: "StableCPUAlert"expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80for: 5m # 增加持续时间annotations:summary: "持续高CPU使用率"# 问题3: 告警通知未发送#注释lwh-csdn
# 检查点:
# 1. Alertmanager配置是否正确
# 2. 路由规则是否匹配
# 3. 静默规则是否影响
# 4. 接收器配置是否正确
六、最佳实践
6.1 告警规则设计原则
分级告警策略
# 三级告警体系#注释lwh-csdn
groups:- name: "三级告警示例#注释lwh-csdn"rules:# Info级别 - 信息提示- alert: "DiskUsageInfo"expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 70for: 5mlabels:severity: infocategory: systemannotations:summary: "磁盘使用率信息"description: "磁盘使用率 {{ printf \"%.2f\" $value }}% 超过70%"# Warning级别 - 需要关注#注释lwh-csdn- alert: "DiskUsageWarning" expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 80for: 3mlabels:severity: warningcategory: systemannotations:summary: "磁盘使用率警告"description: "磁盘使用率 {{ printf \"%.2f\" $value }}% 超过80%,需要关注"# Critical级别 - 需要立即处理 - alert: "DiskUsageCritical"expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90for: 1mlabels:severity: criticalcategory: systemannotations:summary: "磁盘使用率严重"description: "磁盘使用率 {{ printf \"%.2f\" $value }}% 超过90%,需要立即处理"
标签标准化
标准化标签设计
groups:
- name: “标准化告警”
rules:-
alert: “StandardAlert”
expr: up == 0
for: 1m
labels:基础标签
severity: critical # 严重级别: info/warning/critical
category: service # 分类: system/application/network/database/business
team: sre # 负责团队
env: production # 环境: dev/test/staging/production业务标签
service: web-app # 服务名称
component: frontend # 组件名称
region: us-west-2 # 地域扩展标签
runbook: “service-down” # 运维手册标识
impact: high # 影响级别: low/medium/highannotations:
标准注释
summary: “服务不可用”
description: “服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上不可用”扩展注释
runbook
-