@Docker Compose 部署 Prometheus
文章目录
- Docker Compose 部署 Prometheus
- 1. 环境准备
- 2. 配置文件准备
- 3. 编写 Docker Compose 文件
- 4. 启动服务
- 5. 验证部署
- 6. 常用操作
- 7. 生产环境增强建议
- 8. 扩展监控对象
Docker Compose 部署 Prometheus
1. 环境准备
- 安装 Docker(版本 ≥ 20.10)和 Docker Compose(版本 ≥ 1.29)
- 创建项目目录:
mkdir prometheus && cd prometheus
2. 配置文件准备
-
创建 Prometheus 配置文件
prometheus.yml
(基础配置):global:scrape_interval: 15sevaluation_interval: 15sscrape_configs:- job_name: "prometheus"static_configs:- targets: ["localhost:9090"] # 监控自身# 示例:添加 Node Exporter(需额外部署)# - job_name: "node"# static_configs:# - targets: ["node-exporter:9100"]
-
创建告警规则文件(可选)
alerts.yml
:groups: - name: examplerules:- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"
linux_rules.yml
:groups: - name: linux-system-rulesrules:# CPU 相关规则- alert: HighCpuLoadexpr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 10mlabels:severity: warningannotations:summary: "High CPU load on {{ $labels.instance }}"description: "CPU usage is {{ $value }}% for last 10 minutes"# 内存相关规则- alert: HighMemoryUsageexpr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 5 # 修改测试触发告警for: 10mlabels:severity: warningannotations:summary: "High memory usage on {{ $labels.instance }}"description: "Memory usage is {{ $value }}% for last 10 minutes"# 交换分区规则- alert: HighSwapUsageexpr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50for: 15mlabels:severity: warningannotations:summary: "High swap usage on {{ $labels.instance }}"description: "Swap usage is {{ $value }}% for last 15 minutes"# 磁盘空间规则- alert: LowDiskSpaceexpr: (node_filesystem_avail_bytes{mountpoint!~"^(/run|/var/lib/docker).*",fstype!="tmpfs"} / node_filesystem_size_bytes * 100) < 15for: 10mlabels:severity: warningannotations:summary: "Low disk space on {{ $labels.instance }} ({{ $labels.mountpoint }})"description: "Only {{ $value }}% free space left on {{ $labels.mountpoint }}"# 磁盘 I/O 规则- alert: HighDiskIoLoadexpr: rate(node_disk_io_time_seconds_total[1m]) * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High disk I/O load on {{ $labels.instance }} ({{ $labels.device }})"description: "Disk I/O load is {{ $value }}% for last 10 minutes"# 网络相关规则- alert: HighNetworkErrorsexpr: increase(node_network_receive_errs_total[5m]) > 10 or increase(node_network_transmit_errs_total[5m]) > 10for: 5mlabels:severity: warningannotations:summary: "High network errors on {{ $labels.instance }} ({{ $labels.device }})"description: "Network errors detected on interface {{ $labels.device }}"# 系统负载规则- alert: HighSystemLoadexpr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"}) > 1.5for: 15mlabels:severity: warningannotations:summary: "High system load on {{ $labels.instance }}"description: "5-minute load average is {{ $value }} (relative to CPU count)"# 节点宕机规则- alert: InstanceDownexpr: up{job="node"} == 0for: 5mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} has been down for more than 5 minutes"# 文件描述符规则- alert: HighFileDescriptorUsageexpr: node_filefd_allocated / node_filefd_maximum * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High file descriptor usage on {{ $labels.instance }}"description: "File descriptor usage is {{ $value }}% of maximum"
windows_rules.yml
:groups: - name: windows-system-rulesrules:# CPU 相关规则- alert: HighCpuUsageWindowsexpr: 100 - (avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 85for: 10mlabels:severity: warningannotations:summary: "High CPU usage on {{ $labels.instance }}"description: "CPU usage is {{ $value }}% for last 10 minutes"# 内存相关规则- alert: HighMemoryUsageWindowsexpr: (windows_os_physical_memory_total_bytes - windows_os_physical_memory_free_bytes) / windows_os_physical_memory_total_bytes * 100 > 90for: 10mlabels:severity: warningannotations:summary: "High memory usage on {{ $labels.instance }}"description: "Memory usage is {{ $value }}% for last 10 minutes"# 磁盘空间规则- alert: LowDiskSpaceWindowsexpr: (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100) < 95 # 修改测试触发告警for: 10mlabels:severity: warningannotations:summary: "Low disk space on {{ $labels.instance }} ({{ $labels.volume }})"description: "Only {{ $value }}% free space left on {{ $labels.volume }}"# 磁盘 I/O 规则- alert: HighDiskIoWindowsexpr: rate(windows_logical_disk_read_seconds_total[5m]) * 100 > 80 or rate(windows_logical_disk_write_seconds_total[5m]) * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High disk I/O on {{ $labels.instance }} ({{ $labels.volume }})"description: "Disk I/O utilization is {{ $value }}% for last 10 minutes"# 服务状态规则- alert: CriticalServiceDownexpr: windows_service_status{status!="running"} == 1for: 2mlabels:severity: criticalannotations:summary: "Critical service down on {{ $labels.instance }}"description: "Service {{ $labels.service }} is not running"# 系统启动时间规则- alert: SystemRebootedexpr: time() - windows_system_system_up_time > 300for: 0mlabels:severity: infoannotations:summary: "System rebooted on {{ $labels.instance }}"description: "System was rebooted, uptime is {{ $value }} seconds"# 网络连接规则- alert: HighNetworkUtilizationWindowsexpr: rate(windows_net_bytes_total[5m]) / windows_net_speed_bits * 8 * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High network utilization on {{ $labels.instance }} ({{ $labels.interface }})"description: "Network utilization is {{ $value }}% for last 10 minutes"# 进程内存泄漏检测- alert: ProcessMemoryLeakWindowsexpr: predict_linear(windows_process_private_bytes[1h], 3600) / 1024 / 1024 / 1024 > 2for: 30mlabels:severity: warningannotations:summary: "Possible memory leak in {{ $labels.process }} on {{ $labels.instance }}"description: "Process {{ $labels.process }} is predicted to exceed 2GB memory in 1 hour"# 系统日志错误规则- alert: SystemLogErrorsWindowsexpr: rate(windows_event_log_errors_total[5m]) > 5for: 5mlabels:severity: warningannotations:summary: "High system log errors on {{ $labels.instance }}"description: "{{ $value }} errors per second in system logs"
linux_recording_rules.yml
:groups: - name: linux-recording-rulesinterval: 1mrules:# CPU Usage (兼容多版本Node Exporter)- record: instance:node_cpu_usage:rate5mexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle",job=~".*"}[5m])) * 100)# Memory Usage (排除缓存/缓冲区)- record: instance:node_memory_usage:ratioexpr: >(node_memory_MemTotal_bytes - node_memory_MemFree_bytes- node_memory_Buffers_bytes - node_memory_Cached_bytes)/ node_memory_MemTotal_bytes * 100# Disk Space Usage (过滤无效挂载点)- record: instance:node_filesystem_usage:ratioexpr: >(node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"}- node_filesystem_avail_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"})/ node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"} * 100# Network Traffic (过滤虚拟接口)- record: instance:node_network_receive_mbps:rate5mexpr: sum by(instance)(rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])) * 8 / 1048576# System Load (标准化)- record: instance:node_load_ratio:rate5mexpr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"})
3. 编写 Docker Compose 文件
docker-compose.yml
:
version: '3.8'services:prometheus:image: prom/prometheus:latestcontainer_name: prometheusvolumes:- ./prometheus.yml:/etc/prometheus/prometheus.yml- ./alerts.yml:/etc/prometheus/alerts.yml # 挂载告警规则- prometheus-data:/prometheus # 数据持久化command:- '--config.file=/etc/prometheus/prometheus.yml'- '--storage.tsdb.path=/prometheus'- '--web.enable-lifecycle' # 允许热重载配置ports:- "9090:9090"restart: unless-stoppednetworks:- monitor-net# 可选:添加 Grafana 可视化grafana:image: grafana/grafana:latestcontainer_name: grafanavolumes:- grafana-data:/var/lib/grafanaports:- "3000:3000"restart: unless-stoppednetworks:- monitor-net# 可选:添加 Node Exporter 监控主机# node-exporter:# image: prom/node-exporter:latest# container_name: node-exporter# restart: unless-stopped# network_mode: host # 需主机模式# pid: host# volumes:# - /:/host:ro,rslave# command:# - '--path.rootfs=/host'volumes:prometheus-data:grafana-data:networks:monitor-net:driver: bridge
4. 启动服务
docker-compose up -d # 后台启动
5. 验证部署
- Prometheus UI:访问
http://<服务器IP>:9090
- 检查 Targets:Status → Targets
- 查询指标:Graph → 输入
up
查看状态
- Grafana UI(如部署):
http://<服务器IP>:3000
(默认账号 admin/admin)- 添加 Prometheus 数据源:
http://prometheus:9090
- 添加 Prometheus 数据源:
6. 常用操作
- 重载配置(不重启):
curl -X POST http://localhost:9090/-/reload
- 查看日志:
docker-compose logs -f prometheus
- 停止服务:
docker-compose down
- 备份数据:备份
prometheus-data
卷(默认位置:/var/lib/docker/volumes/...
)
7. 生产环境增强建议
- 安全加固:
- 设置 Prometheus
--web.config.file
启用基础认证 - 限制 Grafana 登录策略
- 设置 Prometheus
- 持久化优化:
volumes:prometheus-data:driver_opts:type: nfso: addr=<nfs_server>,rwdevice: ":/path/to/nfs"
- 资源限制:
prometheus:deploy:resources:limits:cpus: '2'memory: 4G
- 高可用方案:
- 部署多个 Prometheus 实例 + Thanos
- 使用 Alertmanager 集群
8. 扩展监控对象
修改 prometheus.yml
添加:
# 监控 Docker 容器
- job_name: "docker"static_configs:- targets: ["docker-host:9323"] # 需配置 docker daemon 暴露 metrics# 监控 MySQL
- job_name: "mysql"static_configs:- targets: ["mysql-exporter:9104"] # 需部署 mysqld-exporter
注:完整配置参考 Prometheus 官方文档