Redis监控告警体系搭建:从零到企业级实战
- 第一章:体系架构设计
- 第二章:环境准备与部署
- 2.1 Redis Exporter 部署
- 2.1.1 Docker 部署方式
- 2.1.2 二进制文件部署
- 2.1.3 Kubernetes部署配置
- 2.2 Prometheus 配置
- 2.3 Alertmanager 配置
- 第三章:Redis监控指标体系
- 3.1 关键性能指标分类
- 3.1.1 内存相关指标
- 3.1.2 连接相关指标
- 3.1.3 性能相关指标
- 3.1.4 持久化相关指标
- 3.2 业务自定义指标
- 3.2.1 大Key监控
- 3.2.2 热点Key监控
- 第四章:告警规则配置
- 4.1 内存告警规则
- 4.2 连接数告警规则
- 4.3 性能告警规则
- 4.4 集群状态告警
- 第五章:Grafana仪表盘开发
- 第六章:高级特性与优化
- 6.1 监控数据降采样
- 6.2 动态标签管理
- 6.3 性能优化配置
- 第七章:实战案例与故障排查
- 第八章:总结与最佳实践
第一章:体系架构设计
1.1 整体架构图
我们先通过一张架构图来全局了解整个监控体系的组件和数据流向。
1.2 组件职责说明
组件 | 职责 | 关键技术点 |
---|
Redis Exporter | 采集Redis指标,暴露HTTP端点供Prometheus抓取 | Go语言开发,支持集群模式 |
Prometheus | 定时抓取指标,存储时序数据,执行告警规则 | 时序数据库,PromQL查询语言 |
Grafana | 数据可视化,创建监控仪表盘 | 数据源插件,面板编辑器 |
Alertmanager | 告警去重、分组、路由、静默 | 分组策略,抑制规则 |
第二章:环境准备与部署
2.1 Redis Exporter 部署
2.1.1 Docker 部署方式
version: '3.8'
services:redis-exporter:image: oliver006/redis_exporter:latestcontainer_name: redis-exporterports:- "9121:9121"environment:- REDIS_ADDR=redis://localhost:6379- REDIS_PASSWORD=your_redis_password- REDIS_ALIAS=production-redisrestart: unless-stoppedhealthcheck:test: ["CMD", "wget", "--quiet", "--spider", "http://localhost:9121/metrics"]interval: 30stimeout: 10sretries: 3
docker-compose up -d
2.1.2 二进制文件部署
wget https://github.com/oliver006/redis_exporter/releases/download/v1.45.0/redis_exporter-v1.45.0.linux-amd64.tar.gz
tar -xzf redis_exporter-v1.45.0.linux-amd64.tar.gz
cd redis_exporter-v1.45.0.linux-amd64
./redis_exporter \-redis.addr redis1:6379,redis2:6379 \-redis.password file:/etc/redis/password.txt \-web.listen-address :9121 \-web.telemetry-path /metrics \-log-format json
2.1.3 Kubernetes部署配置
apiVersion: apps/v1
kind: Deployment
metadata:name: redis-exporternamespace: monitoring
spec:replicas: 2selector:matchLabels:app: redis-exportertemplate:metadata:labels:app: redis-exporterannotations:prometheus.io/scrape: "true"prometheus.io/port: "9121"prometheus.io/path: "/metrics"spec:containers:- name: redis-exporterimage: oliver006/redis_exporter:v1.45.0ports:- containerPort: 9121env:- name: REDIS_ADDRvalue: "redis-service:6379"- name: REDIS_PASSWORDvalueFrom:secretKeyRef:name: redis-secretkey: passwordresources:requests:memory: "64Mi"cpu: "50m"limits:memory: "128Mi"cpu: "100m"livenessProbe:httpGet:path: /metricsport: 9121initialDelaySeconds: 30periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:name: redis-exporter-servicenamespace: monitoringlabels:app: redis-exporter
spec:selector:app: redis-exporterports:- name: metricsport: 9121targetPort: 9121
2.2 Prometheus 配置
2.2.1 主配置文件
global:scrape_interval: 15sevaluation_interval: 15sexternal_labels:environment: productioncluster: redis-cluster-01
rule_files:- "alerts/redis_alerts.yml"- "alerts/system_alerts.yml"
scrape_configs:- job_name: 'redis'static_configs:- targets:- 'redis-exporter-service:9121' - '192.168.1.100:9121' metrics_path: /metricsparams:format: ['prometheus']relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: redis-exporter-service:9121scrape_interval: 30sscrape_timeout: 10s- job_name: 'prometheus'static_configs:- targets: ['localhost:9090']- job_name: 'node'static_configs:- targets: ['node-exporter:9100']scrape_interval: 30s
2.2.2 记录规则配置
groups:
- name: redis_recording_rulesinterval: 30srules:- record: redis:memory_usage_percentexpr: redis_memory_used_bytes / redis_memory_max_bytes * 100- record: redis:connected_clients_percentexpr: redis_connected_clients / redis_maxclients * 100- record: redis:instantaneous_ops_per_secondexpr: rate(redis_commands_processed_total[2m])- record: redis:keyspace_hits_rateexpr: rate(redis_keyspace_hits_total[5m])- record: redis:keyspace_misses_rateexpr: rate(redis_keyspace_misses_total[5m])- record: redis:hit_ratioexpr: redis:keyspace_hits_rate / (redis:keyspace_hits_rate + redis:keyspace_misses_rate) * 100- record: redis:network_input_rateexpr: rate(redis_net_input_bytes_total[2m])- record: redis:network_output_rateexpr: rate(redis_net_output_bytes_total[2m])
2.3 Alertmanager 配置
2.3.1 告警路由配置
global:smtp_smarthost: 'smtp.qq.com:587'smtp_from: 'monitoring@company.com'smtp_auth_username: 'monitoring@company.com'smtp_auth_password: 'your-smtp-password'route:group_by: ['alertname', 'cluster']group_wait: 10sgroup_interval: 5mrepeat_interval: 1hreceiver: 'default-receiver'routes:- match:severity: criticalreceiver: 'critical-alerts'group_interval: 1mrepeat_interval: 5m- match:service: redisreceiver: 'redis-team'routes:- match:severity: warningreceiver: 'redis-warning'- match:severity: criticalreceiver: 'redis-critical'inhibit_rules:
- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'cluster', 'instance']receivers:
- name: 'default-receiver'email_configs:- to: 'devops-team@company.com'headers:subject: '[Monitoring] Alert: {{ .GroupLabels.alertname }}'- name: 'critical-alerts'email_configs:- to: 'sre-team@company.com'webhook_configs:- url: 'http://alert-hook:8080/alerts'send_resolved: true- name: 'redis-team'email_configs:- to: 'redis-dba@company.com'slack_configs:- api_url: 'https://hooks.slack.com/services/xxx'channel: '#alerts-redis'title: 'Redis Alert: {{ .CommonLabels.alertname }}'text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
2.3.2 钉钉告警配置
- name: 'dingtalk-redis'webhook_configs:- url: 'https://oapi.dingtalk.com/robot/send?access_token=your-token'send_resolved: truehttp_config:bearer_token: 'your-bearer-token'
第三章:Redis监控指标体系
3.1 关键性能指标分类
3.1.1 内存相关指标
# 内存使用率
redis_memory_used_bytes / redis_memory_max_bytes * 100# 内存碎片率
redis_memory_fragmentation_ratio# 内存使用趋势预测
predict_linear(redis_memory_used_bytes[6h], 86400)# 内存淘汰策略
redis_evicted_keys_total
3.1.2 连接相关指标
# 连接数使用率
redis_connected_clients / redis_maxclients * 100# 连接拒绝率
rate(redis_rejected_connections_total[5m])# 连接数趋势
rate(redis_connected_clients[10m])
3.1.3 性能相关指标
# 每秒操作数
rate(redis_commands_processed_total[2m])# 命令延迟分布
histogram_quantile(0.95, rate(redis_commands_duration_seconds_bucket[5m])
)# 缓存命中率
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100
3.1.4 持久化相关指标
# RDB持久化状态
redis_rdb_last_save_timestamp_seconds
redis_rdb_changes_since_last_save# AOF持久化状态
redis_aof_enabled
redis_aof_last_rewrite_time_seconds
3.2 业务自定义指标
3.2.1 大Key监控
redis-cli --bigkeys -i 0.1
local keys = redis.call('keys', ARGV[1])
local result = {}
for i, key in ipairs(keys) dolocal size = redis.call('memory', 'usage', key)if size > tonumber(ARGV[2]) thentable.insert(result, {key, size})end
end
return result
3.2.2 热点Key监控
# 通过命令统计识别热点Key
rate(redis_cmdstat_get_calls[2m])
rate(redis_cmdstat_set_calls[2m])# 自定义指标暴露
# 在应用中埋点热点Key访问
第四章:告警规则配置
4.1 内存告警规则
groups:
- name: redis_memory_alertsrules:- alert: RedisMemoryUsageCriticalexpr: redis:memory_usage_percent > 90for: 5mlabels:severity: criticalservice: redisannotations:summary: "Redis内存使用率超过90%"description: "实例 {{ $labels.instance }} 内存使用率当前为 {{ $value }}%,请立即处理"runbook: "https://wiki/redis-memory-optimization"- alert: RedisMemoryFragmentationHighexpr: redis_memory_fragmentation_ratio > 1.5for: 10mlabels:severity: warningannotations:summary: "Redis内存碎片率过高"description: "实例 {{ $labels.instance }} 内存碎片率为 {{ $value }},建议进行碎片整理"- alert: RedisMemoryOOMWarningexpr: predict_linear(redis_memory_used_bytes[1h], 3600) / redis_memory_max_bytes > 1for: 2mlabels:severity: criticalannotations:summary: "Redis预计将发生OOM"description: "实例 {{ $labels.instance }} 预计1小时内将内存耗尽,当前使用率 {{ $value }}%"
4.2 连接数告警规则
- name: redis_connection_alertsrules:- alert: RedisConnectionsHighexpr: redis:connected_clients_percent > 80for: 5mlabels:severity: warningannotations:summary: "Redis连接数使用率过高"description: "实例 {{ $labels.instance }} 连接数使用率 {{ $value }}%,最大连接数 {{ $labels.maxclients }}"- alert: RedisConnectionsRejectedexpr: rate(redis_rejected_connections_total[5m]) > 10for: 1mlabels:severity: criticalannotations:summary: "Redis拒绝连接"description: "实例 {{ $labels.instance }} 在5分钟内拒绝 {{ $value }} 个连接"
4.3 性能告警规则
- name: redis_performance_alertsrules:- alert: RedisHighLatencyexpr: histogram_quantile(0.95, rate(redis_commands_duration_seconds_bucket[2m])) > 0.1for: 3mlabels:severity: warningannotations:summary: "Redis命令延迟过高"description: "实例 {{ $labels.instance }} P95延迟为 {{ $value }}s"- alert: RedisLowHitRatioexpr: redis:hit_ratio < 80for: 10mlabels:severity: warningannotations:summary: "Redis缓存命中率过低"description: "实例 {{ $labels.instance }} 命中率仅为 {{ $value }}%"- alert: RedisHighCPUUsageexpr: rate(redis_cpu_sys_seconds_total[2m]) + rate(redis_cpu_user_seconds_total[2m]) > 0.8for: 5mlabels:severity: warningannotations:summary: "Redis CPU使用率过高"description: "实例 {{ $labels.instance }} CPU使用率 {{ $value }}%"
4.4 集群状态告警
- name: redis_cluster_alertsrules:- alert: RedisClusterDownexpr: redis_cluster_state != 1for: 1mlabels:severity: criticalannotations:summary: "Redis集群状态异常"description: "集群 {{ $labels.cluster }} 状态异常,当前状态码: {{ $value }}"- alert: RedisMasterLinkDownexpr: redis_master_link_status == 0for: 30slabels:severity: criticalannotations:summary: "Redis主从复制中断"description: "从库 {{ $labels.instance }} 与主库复制连接中断"
第五章:Grafana仪表盘开发
5.1 总体监控仪表盘
5.1.1 全局概览面板
{"dashboard": {"title": "Redis集群监控概览","tags": ["redis", "monitoring"],"timezone": "browser","panels": [{"title": "内存使用率","type": "stat","targets": [{"expr": "redis:memory_usage_percent","legendFormat": "{{instance}}"}],"thresholds": [{"value": 80, "color": "yellow"},{"value": 90, "color": "red"}]},{"title": "连接数趋势","type": "timeseries","targets": [{"expr": "redis_connected_clients","legendFormat": "{{instance}}"}]}]}
}
5.1.2 性能监控面板
{"title": "Redis性能监控","gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},"targets": [{"expr": "rate(redis_commands_processed_total[2m])","legendFormat": "{{instance}} OPS"}],"fieldConfig": {"defaults": {"color": {"mode": "palette-classic"},"thresholds": {"steps": [{"value": null, "color": "green"},{"value": 1000, "color": "yellow"},{"value": 5000, "color": "red"}]}}}
}
5.2 高级监控功能
5.2.1 预测性监控
# 内存增长预测
predict_linear(redis_memory_used_bytes[6h], 86400)# 容量规划预警
- alert: RedisCapacityPlanningexpr: predict_linear(redis_memory_used_bytes[24h], 604800) / redis_memory_max_bytes > 0.8for: 1hlabels:severity: warningannotations:summary: "Redis容量规划预警"description: "实例 {{ $labels.instance }} 预计7天后内存使用率达到 {{ $value }}%"
5.2.2 多集群监控
global:external_labels:region: us-east-1environment: productioncluster: redis-cluster-01
- record: cluster:memory_usage_avgexpr: avg by (cluster) (redis:memory_usage_percent)- record: cluster:qps_sum expr: sum by (cluster) (redis:instantaneous_ops_per_second)
第六章:高级特性与优化
6.1 监控数据降采样
remote_write:- url: http://victoriametrics:8428/api/v1/writewrite_relabel_configs:- action: keepregex: redis_(memory|connected|commands).*source_labels: [__name__]
groups:
- name: redis_downsampleinterval: 1hrules:- record: redis:memory_usage_percent:1hexpr: avg_over_time(redis:memory_usage_percent[1h])
6.2 动态标签管理
relabel_configs:- source_labels: [__meta_kubernetes_pod_label_app]target_label: application- source_labels: [__meta_kubernetes_namespace]target_label: namespace- regex: "(.*)"target_label: environmentreplacement: "production"
6.3 性能优化配置
global:scrape_interval: 30sscrape_timeout: 10s
scrape_configs:- job_name: 'redis'sample_limit: 5000label_limit: 50label_name_length_limit: 100label_value_length_limit: 100
第七章:实战案例与故障排查
7.1 常见故障场景
7.1.1 内存泄漏排查
# 内存增长趋势分析
rate(redis_memory_used_bytes[1h])# Key数量监控
redis_db_keys{db="db0"}# 大Key识别
redis_memory_usage_key
7.1.2 性能瓶颈分析
# 慢查询分析
rate(redis_slowlog_length[5m])# 网络带宽监控
rate(redis_net_input_bytes_total[2m])
rate(redis_net_output_bytes_total[2m])# 命令耗时分布
histogram_quantile(0.99, rate(redis_commands_duration_seconds_bucket[5m]))
7.2 监控体系验证
7.2.1 端到端测试
curl http://redis-exporter:9121/metrics | grep redis_up
curl http://prometheus:9090/api/v1/query?query=redis_up
curl -X POST http://alertmanager:9093/api/v1/alerts -d '[{"labels": {"alertname": "TestAlert","instance": "test-redis","severity": "warning"},"annotations": {"summary": "测试告警","description": "这是一个测试告警"}}
]'
第八章:总结与最佳实践
8.1 监控体系检查清单
检查项 | 标准 | 验证方法 |
---|
数据采集完整性 | 所有Redis实例都被监控 | redis_up == 1 |
告警规则有效性 | 关键指标都有对应告警 | 模拟触发测试 |
通知渠道畅通 | 告警能正确送达 | 端到端测试 |
仪表盘可用性 | 主要指标可视化 | Grafana面板检查 |
性能影响评估 | 监控系统资源占用合理 | 资源监控 |
8.2 持续优化建议
- 定期评审告警规则:根据误报和漏报情况调整阈值
- 优化数据保留策略:根据存储成本调整数据保留时间
- 容量规划:根据业务增长预测监控系统容量需求
- 文档维护:保持Runbook和故障处理流程的更新
通过本指南的完整实施,你将建立起一个覆盖Redis全方面监控的企业级监控体系,能够及时发现和处理各种Redis相关问题,保障业务的稳定运行。