Prometheus + Alertmanager + 钉钉告警
以下是基于本地服务器部署的 Prometheus + Alertmanager + 钉钉告警 完整流程及配置代码,整合了所有正确配置并修正了潜在问题:
一、整体总览
- 部署基础组件(Prometheus、Alertmanager、Node Exporter)
- 部署钉钉告警中间件(prometheus-webhook-dingtalk)
- 配置钉钉机器人
- 配置告警规则与转发流程
- 测试告警
二、完整步骤及代码
步骤 1:部署基础组件
1.1 安装 Prometheus
# 下载并部署
cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xf prometheus-2.45.0.linux-amd64.tar.gz
mv prometheus-2.45.0.linux-amd64 /usr/local/prometheus# 创建systemd服务
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target[Service]
User=root
WorkingDirectory=/usr/local/prometheus
ExecStart=/usr/local/prometheus/prometheus --config.file=prometheus.yml --storage.tsdb.path=data/
ExecReload=/bin/kill -HUP $MAINPID # 支持reload
Restart=always[Install]
WantedBy=multi-user.target
EOF# 启动服务
systemctl daemon-reload
systemctl enable --now prometheus
1.2 安装 Alertmanager
# 下载并部署
cd /opt
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xf alertmanager-0.25.0.linux-amd64.tar.gz
mv alertmanager-0.25.0.linux-amd64 /usr/local/alertmanager# 创建systemd服务
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alertmanager
After=network.target[Service]
User=root
WorkingDirectory=/usr/local/alertmanager
ExecStart=/usr/local/alertmanager/alertmanager --config.file=alertmanager.yml
ExecReload=/bin/kill -HUP $MAINPID # 支持reload
Restart=always[Install]
WantedBy=multi-user.target
EOF# 启动服务
systemctl daemon-reload
systemctl enable --now alertmanager
1.3 安装 Node Exporter
# 下载并部署
cd /opt
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xf node_exporter-1.6.1.linux-amd64.tar.gz
mv node_exporter-1.6.1.linux-amd64 /usr/local/node_exporter# 创建systemd服务
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target[Service]
User=root
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
Restart=always[Install]
WantedBy=multi-user.target
EOF# 启动服务
systemctl daemon-reload
systemctl enable --now node_exporter
步骤 2:部署钉钉告警中间件
# 下载并部署
cd /opt
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/dingtalk# 创建systemd服务
cat > /etc/systemd/system/dingtalk-webhook.service << 'EOF'
[Unit]
Description=Prometheus Webhook for DingTalk
After=network.target[Service]
User=root
WorkingDirectory=/usr/local/dingtalk
ExecStart=/usr/local/dingtalk/prometheus-webhook-dingtalk --config.file=config.yml
Restart=always[Install]
WantedBy=multi-user.target
EOF
步骤 3:配置钉钉机器人
- 钉钉群 → 群设置 → 智能群助手 → 添加机器人 → 自定义机器人
- 开启 “消息推送”,复制 Webhook 地址(如
https://oapi.dingtalk.com/robot/send?access_token=xxx) - 安全设置勾选 “加签”,复制 Secret(如
SECxxx)
步骤 4:配置各组件关联
4.1 配置 Prometheus(关联 Alertmanager 和告警规则)
# 创建告警规则目录
mkdir -p /usr/local/prometheus/rules# 编写Prometheus主配置
cat > /usr/local/prometheus/prometheus.yml << 'EOF'
global:scrape_interval: 15sevaluation_interval: 15s# 关联Alertmanager
alerting:alertmanagers:- static_configs:- targets: ["localhost:9093"] # Alertmanager地址# 告警规则文件路径
rule_files:- "rules/node_rules.yml"# 监控目标配置
scrape_configs:- job_name: "prometheus"static_configs:- targets: ["localhost:9090"]- job_name: "node_exporter"static_configs:- targets: ["localhost:9100"] # Node Exporter地址- job_name: "alertmanager"static_configs:- targets: ["localhost:9093"]
EOF# 编写告警规则
cat > /usr/local/prometheus/rules/node_rules.yml << 'EOF'
groups:
- name: node_alertsrules:# 规则1:Node Exporter宕机告警- alert: NodeExporterDownexpr: up{job="node_exporter"} == 0for: 1mlabels:severity: criticalannotations:summary: "Node Exporter服务宕机"description: "服务器 {{ $labels.instance }} 的Node Exporter已停止运行超过1分钟"# 规则2:CPU使用率过高告警- alert: HighCPUUsageexpr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80for: 5mlabels:severity: warningannotations:summary: "服务器CPU使用率过高"description: "服务器 {{ $labels.instance }} 的CPU使用率已超过80%,当前值:{{ $value | humanizePercentage }}"
EOF
4.2 配置钉钉中间件(对接钉钉机器人)
# 编写配置文件(替换为实际的Webhook和Secret)
cat > /usr/local/dingtalk/config.yml << 'EOF'
timeout: 5stemplates:- contrib/templates/legacy/template.tmpldefault_message:title: '{{ template "legacy.title" . }}'text: '{{ template "legacy.content" . }}'targets:webhook1: # 与Alertmanager配置中的路径对应url: "https://oapi.dingtalk.com/robot/send?access_token=替换为你的Webhook"secret: "替换为你的加签Secret"
EOF
4.3 配置 Alertmanager(转发告警到钉钉)
cat > /usr/local/alertmanager/alertmanager.yml << 'EOF'
global:resolve_timeout: 5mroute:group_by: [alertname]group_wait: 10sgroup_interval: 15srepeat_interval: 20mreceiver: 'dingtalk-webhook1' # 对应下方接收者名称receivers:
- name: 'dingtalk-webhook1'webhook_configs:- url: 'http://localhost:8060/dingtalk/webhook1/send' # 钉钉中间件地址send_resolved: true # 发送告警恢复通知
EOF
步骤 5:启动所有服务并测试
# 重启所有服务使配置生效
systemctl restart prometheus alertmanager dingtalk-webhook node_exporter# 检查服务状态(确保均为running)
systemctl status prometheus alertmanager dingtalk-webhook node_exporter# 测试告警:停止Node Exporter触发宕机告警
systemctl stop node_exporter# 等待1分钟后,检查钉钉群是否收到告警
# 恢复服务,检查告警恢复通知
systemctl start node_exporter
三、验证与排查
- Prometheus 告警状态:访问
http://服务器IP:9090/alerts,查看规则是否为FIRING - Alertmanager 转发状态:访问
http://服务器IP:9093,查看是否有告警 - 钉钉中间件日志:
journalctl -u dingtalk-webhook -f,检查是否有转发记录 - 常见问题:
- 端口占用:
netstat -tlnp | grep 8060(钉钉中间件)、9090(Prometheus)等 - 配置错误:通过
promtool check config /usr/local/prometheus/prometheus.yml验证 Prometheus 配置
- 端口占用:
通过以上步骤,即可实现服务器指标异常时自动发送钉钉告警,包括服务宕机、CPU 过高等场景。
