当前位置: 首页 > news >正文

Spring Boot 应用 Docker 监控:Prometheus + Grafana 全方位监控

Spring Boot 应用 Docker 监控:Prometheus + Grafana 全方位监控

    • 摘要
    • 第 1 章 监控体系架构设计
      • 1.1 监控系统整体架构
      • 1.2 各组件职责说明
      • 1.3 监控指标分类
        • 1.3.1 应用层指标
        • 1.3.2 系统层指标
        • 1.3.3 业务层指标
    • 第 2 章 Spring Boot 应用监控配置
      • 2.1 添加监控依赖
      • 2.2 应用配置
      • 2.3 自定义业务指标
      • 2.4 业务服务集成监控
      • 2.5 Controller 层监控
      • 2.6 数据模型
    • 第 3 章 Docker 化部署配置
      • 3.1 Spring Boot 应用 Dockerfile
      • 3.2 Docker Compose 完整配置
      • 3.3 Prometheus 配置
      • 3.4 告警规则配置
      • 3.5 Alertmanager 配置
    • 第 4 章 Grafana 仪表板配置
      • 4.1 数据源配置
      • 4.2 仪表板配置
      • 4.3 Spring Boot 应用仪表板
      • 4.4 完整仪表板配置
    • 第 5 章 高级监控特性
      • 5.1 自定义指标端点
      • 5.2 分布式追踪集成
      • 5.3 性能测试与监控验证
    • 第 6 章 生产环境部署与优化
      • 6.1 生产级 Docker Compose
      • 6.2 监控数据持久化
      • 6.3 安全配置
    • 第 7 章 故障排查与性能优化
      • 7.1 常见问题排查
        • 7.1.1 指标无法收集
        • 7.1.2 内存泄漏排查
      • 7.2 性能优化建议
    • 总结

摘要

在现代微服务架构中,Spring Boot 应用的监控是确保系统稳定性和性能的关键。本文将深入探讨如何使用 Prometheus + Grafana 构建完整的 Docker 化监控体系,覆盖从应用指标暴露、容器监控到业务指标的全方位监控方案。通过详细的代码示例、配置文件和实战案例,展示如何实现从零到生产级的监控系统。
关键词: Spring Boot, Docker, Prometheus, Grafana, 监控, 微服务, 容器化

第 1 章 监控体系架构设计

1.1 监控系统整体架构

现代 Spring Boot 应用的监控体系应该包含以下核心组件:

Spring Boot 应用
Micrometer 指标
Prometheus 抓取
Prometheus 存储
Grafana 可视化
Docker 容器
cAdvisor 监控
主机系统
Node Exporter
告警通知
Email/Slack/Webhook

1.2 各组件职责说明

组件职责技术选型
指标收集应用指标暴露Micrometer, Spring Boot Actuator
指标抓取定期拉取指标Prometheus
容器监控容器资源监控cAdvisor
主机监控系统资源监控Node Exporter
可视化指标展示与分析Grafana
告警异常检测与通知Alertmanager

1.3 监控指标分类

1.3.1 应用层指标
  • JVM 内存、GC、线程池
  • HTTP 请求指标、响应时间
  • 业务自定义指标
  • 数据库连接池指标
1.3.2 系统层指标
  • CPU、内存、磁盘使用率
  • 网络 I/O、磁盘 I/O
  • 容器资源使用情况
1.3.3 业务层指标
  • 订单量、用户活跃度
  • 业务异常统计
  • 关键业务流程指标

第 2 章 Spring Boot 应用监控配置

2.1 添加监控依赖

首先在 pom.xml 中添加必要的监控依赖:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"><modelVersion>4.0.0</modelVersion><groupId>com.example</groupId><artifactId>monitored-spring-boot-app</artifactId><version>1.0.0</version><parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>2.7.0</version></parent><dependencies><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-actuator</artifactId></dependency><dependency><groupId>io.micrometer</groupId><artifactId>micrometer-registry-prometheus</artifactId></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-jpa</artifactId></dependency><dependency><groupId>com.h2database</groupId><artifactId>h2</artifactId><scope>runtime</scope></dependency></dependencies>
</project>

2.2 应用配置

配置 application.yml 启用监控端点:

server:port: 8080management:endpoints:web:exposure:include: health,info,metrics,prometheus,env,beansbase-path: /actuatorenabled-by-default: trueendpoint:health:show-details: alwaysshow-components: alwaysprometheus:enabled: truemetrics:export:prometheus:enabled: truedistribution:percentiles-histogram:http.server.requests: trueweb:server:request:autotime:enabled: truetags:application: ${spring.application.name}environment: ${spring.profiles.active:default}spring:application:name: order-serviceprofiles:active: dockerdatasource:url: jdbc:h2:mem:testdbdriver-class-name: org.h2.Driverusername: sapassword: ''jpa:database-platform: org.hibernate.dialect.H2Dialecthibernate:ddl-auto: create-dropshow-sql: truelogging:level:org.springframework.web: DEBUGio.micrometer: DEBUG

2.3 自定义业务指标

创建自定义指标监控业务逻辑:

package com.example.monitoring.service;import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Component;import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.TimeUnit;@Component
public class OrderMetrics {private final Counter orderCreatedCounter;private final Counter orderFailedCounter;private final Timer orderProcessingTimer;private final ConcurrentHashMap<String, Counter> statusCounters;private final MeterRegistry meterRegistry;public OrderMetrics(MeterRegistry meterRegistry) {this.meterRegistry = meterRegistry;this.orderCreatedCounter = Counter.builder("order.created").description("Number of orders created").tag("application", "order-service").register(meterRegistry);this.orderFailedCounter = Counter.builder("order.failed").description("Number of failed orders").tag("application", "order-service").register(meterRegistry);this.orderProcessingTimer = Timer.builder("order.processing.time").description("Time taken to process orders").tag("application", "order-service").register(meterRegistry);this.statusCounters = new ConcurrentHashMap<>();}public void incrementOrderCreated() {orderCreatedCounter.increment();}public void incrementOrderFailed(String reason) {orderFailedCounter.increment();// 按失败原因统计Counter reasonCounter = statusCounters.computeIfAbsent(reason, k -> Counter.builder("order.failed.by.reason").description("Orders failed by reason").tag("reason", k).register(meterRegistry));reasonCounter.increment();}public Timer.Sample startTimer() {return Timer.start(meterRegistry);}public void recordTimer(Timer.Sample sample) {sample.stop(orderProcessingTimer);}
}

2.4 业务服务集成监控

在业务服务中使用监控指标:

package com.example.monitoring.service;import com.example.monitoring.model.Order;
import com.example.monitoring.repository.OrderRepository;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;import java.util.Optional;
import java.util.Random;@Service
@Transactional
public class OrderService {private final OrderRepository orderRepository;private final OrderMetrics orderMetrics;private final Random random = new Random();public OrderService(OrderRepository orderRepository, OrderMetrics orderMetrics) {this.orderRepository = orderRepository;this.orderMetrics = orderMetrics;}public Order createOrder(Order order) {// 开始计时var timerSample = orderMetrics.startTimer();try {// 模拟业务处理simulateProcessing();// 随机模拟失败情况if (random.nextInt(10) == 0) { // 10% 失败率throw new RuntimeException("Payment processing failed");}Order savedOrder = orderRepository.save(order);orderMetrics.incrementOrderCreated();return savedOrder;} catch (Exception e) {orderMetrics.incrementOrderFailed(e.getMessage());throw e;} finally {// 记录处理时间orderMetrics.recordTimer(timerSample);}}public Optional<Order> getOrder(Long id) {return orderRepository.findById(id);}private void simulateProcessing() throws InterruptedException {// 模拟处理时间 100-500msThread.sleep(100 + random.nextInt(400));}
}

2.5 Controller 层监控

package com.example.monitoring.controller;import com.example.monitoring.model.Order;
import com.example.monitoring.service.OrderService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;import java.util.Optional;@RestController
@RequestMapping("/api/orders")
public class OrderController {private final OrderService orderService;public OrderController(OrderService orderService) {this.orderService = orderService;}@PostMappingpublic ResponseEntity<Order> createOrder(@RequestBody Order order) {try {Order createdOrder = orderService.createOrder(order);return ResponseEntity.ok(createdOrder);} catch (Exception e) {return ResponseEntity.badRequest().build();}}@GetMapping("/{id}")public ResponseEntity<Order> getOrder(@PathVariable Long id) {Optional<Order> order = orderService.getOrder(id);return order.map(ResponseEntity::ok).orElse(ResponseEntity.notFound().build());}@GetMapping("/health")public ResponseEntity<String> health() {return ResponseEntity.ok("Service is healthy");}
}

2.6 数据模型

package com.example.monitoring.model;import javax.persistence.*;
import java.time.LocalDateTime;@Entity
@Table(name = "orders")
public class Order {@Id@GeneratedValue(strategy = GenerationType.IDENTITY)private Long id;private String orderNumber;private Double amount;private String customerEmail;@Enumerated(EnumType.STRING)private OrderStatus status;private LocalDateTime createdAt;// 构造器、getter、setterpublic Order() {this.createdAt = LocalDateTime.now();this.status = OrderStatus.PENDING;}public enum OrderStatus {PENDING, PROCESSING, COMPLETED, FAILED}// getters and setterspublic Long getId() { return id; }public void setId(Long id) { this.id = id; }public String getOrderNumber() { return orderNumber; }public void setOrderNumber(String orderNumber) { this.orderNumber = orderNumber; }public Double getAmount() { return amount; }public void setAmount(Double amount) { this.amount = amount; }public String getCustomerEmail() { return customerEmail; }public void setCustomerEmail(String customerEmail) { this.customerEmail = customerEmail; }public OrderStatus getStatus() { return status; }public void setStatus(OrderStatus status) { this.status = status; }public LocalDateTime getCreatedAt() { return createdAt; }public void setCreatedAt(LocalDateTime createdAt) { this.createdAt = createdAt; }
}

第 3 章 Docker 化部署配置

3.1 Spring Boot 应用 Dockerfile

创建优化的 Dockerfile:

# 多阶段构建优化镜像大小
FROM maven:3.8.6-openjdk-17 as builderWORKDIR /app
COPY pom.xml .
COPY src ./srcRUN mvn clean package -DskipTestsFROM openjdk:17-jre-slim# 安装 curl 用于健康检查
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*WORKDIR /app
COPY --from=builder /app/target/*.jar app.jar# 创建非root用户
RUN groupadd -r spring && useradd -r -g spring spring
USER spring# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \CMD curl -f http://localhost:8080/actuator/health || exit 1EXPOSE 8080ENTRYPOINT ["java", "-jar", "/app.jar"]

3.2 Docker Compose 完整配置

创建 docker-compose.yml 定义完整的监控栈:

version: '3.8'services:# Spring Boot 应用服务order-service:build: .ports:- "8080:8080"environment:- SPRING_PROFILES_ACTIVE=docker- MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=health,info,metrics,prometheusnetworks:- monitoring-networklabels:- "prometheus.scrape=true"- "prometheus.port=8080"- "prometheus.path=/actuator/prometheus"depends_on:- prometheus# Prometheus 监控服务prometheus:image: prom/prometheus:v2.40.0ports:- "9090:9090"volumes:- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml- prometheus_data:/prometheuscommand:- '--config.file=/etc/prometheus/prometheus.yml'- '--storage.tsdb.path=/prometheus'- '--web.console.libraries=/etc/prometheus/console_libraries'- '--web.console.templates=/etc/prometheus/consoles'- '--storage.tsdb.retention.time=200h'- '--web.enable-lifecycle'networks:- monitoring-networkrestart: unless-stopped# Grafana 可视化grafana:image: grafana/grafana:9.3.2ports:- "3000:3000"environment:- GF_SECURITY_ADMIN_USER=admin- GF_SECURITY_ADMIN_PASSWORD=admin123- GF_USERS_ALLOW_SIGN_UP=falsevolumes:- ./grafana/provisioning:/etc/grafana/provisioning- ./grafana/dashboards:/var/lib/grafana/dashboards- grafana_data:/var/lib/grafananetworks:- monitoring-networkdepends_on:- prometheusrestart: unless-stopped# cAdvisor 容器监控cadvisor:image: gcr.io/cadvisor/cadvisor:v0.47.0ports:- "8081:8080"volumes:- /:/rootfs:ro- /var/run:/var/run:ro- /sys:/sys:ro- /var/lib/docker/:/var/lib/docker:ro- /dev/disk/:/dev/disk:rodevices:- /dev/kmsgnetworks:- monitoring-networkprivileged: truerestart: unless-stopped# Node Exporter 主机监控node-exporter:image: prom/node-exporter:v1.5.0ports:- "9100:9100"volumes:- /proc:/host/proc:ro- /sys:/host/sys:ro- /:/rootfs:rocommand:- '--path.procfs=/host/proc'- '--path.rootfs=/rootfs'- '--path.sysfs=/host/sys'- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'networks:- monitoring-networkrestart: unless-stopped# Alertmanager 告警管理alertmanager:image: prom/alertmanager:v0.25.0ports:- "9093:9093"volumes:- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml- alertmanager_data:/alertmanagercommand:- '--config.file=/etc/alertmanager/alertmanager.yml'- '--storage.path=/alertmanager'networks:- monitoring-networkrestart: unless-stoppedvolumes:prometheus_data:grafana_data:alertmanager_data:networks:monitoring-network:driver: bridge

3.3 Prometheus 配置

创建 prometheus/prometheus.yml 配置文件:

global:scrape_interval: 15sevaluation_interval: 15sexternal_labels:environment: 'docker-monitoring'# 告警规则配置
rule_files:- "alerting_rules.yml"# 抓取配置
scrape_configs:# Prometheus 自身监控- job_name: 'prometheus'static_configs:- targets: ['localhost:9090']scrape_interval: 10s# Spring Boot 应用监控- job_name: 'spring-boot-apps'metrics_path: '/actuator/prometheus'scrape_interval: 15sstatic_configs:- targets: ['order-service:8080']relabel_configs:- source_labels: [__address__]target_label: __scheme__regex: '(.*)'replacement: 'http'- source_labels: [__address__]target_label: instanceregex: '(.*):(.*)'replacement: '${1}'# 节点监控- job_name: 'node-exporter'static_configs:- targets: ['node-exporter:9100']scrape_interval: 20s# 容器监控- job_name: 'cadvisor'static_configs:- targets: ['cadvisor:8080']scrape_interval: 20s# Alertmanager 监控- job_name: 'alertmanager'static_configs:- targets: ['alertmanager:9093']scrape_interval: 30s# 告警配置
alerting:alertmanagers:- static_configs:- targets:- alertmanager:9093

3.4 告警规则配置

创建 prometheus/alerting_rules.yml:

groups:- name: spring-boot-alertsrules:# JVM 内存告警- alert: HighJVMMemoryUsageexpr: sum(container_memory_usage_bytes{container_label_io_kubernetes_pod_name=~"order-service.*"}) / (1024 * 1024) > 512for: 2mlabels:severity: warningservice: order-serviceannotations:summary: "High JVM Memory Usage"description: "JVM memory usage is above 512MB for more than 2 minutes"# 应用宕机告警- alert: ApplicationDownexpr: up{job="spring-boot-apps"} == 0for: 1mlabels:severity: criticalannotations:summary: "Application is down"description: "The application has been down for more than 1 minute"# 高错误率告警- alert: HighErrorRateexpr: rate(http_server_requests_seconds_count{outcome="SERVER_ERROR"}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05for: 3mlabels:severity: warningannotations:summary: "High error rate detected"description: "Error rate is above 5% for more than 3 minutes"# 高响应时间告警- alert: HighResponseTimeexpr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2for: 3mlabels:severity: warningannotations:summary: "High response time detected"description: "95th percentile response time is above 2 seconds"- name: system-alertsrules:# 高CPU使用率告警- alert: HighCPUUsageexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 5mlabels:severity: warningannotations:summary: "High CPU usage"description: "CPU usage is above 80% for more than 5 minutes"# 高内存使用率告警- alert: HighMemoryUsageexpr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85for: 5mlabels:severity: warningannotations:summary: "High memory usage"description: "Memory usage is above 85% for more than 5 minutes"# 磁盘空间告警- alert: LowDiskSpaceexpr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15for: 10mlabels:severity: criticalannotations:summary: "Low disk space"description: "Disk space is below 15%"

3.5 Alertmanager 配置

创建 alertmanager/alertmanager.yml:

global:smtp_smarthost: 'smtp.gmail.com:587'smtp_from: 'alerts@yourcompany.com'smtp_auth_username: 'alerts@yourcompany.com'smtp_auth_password: 'your-app-password'templates:- '/etc/alertmanager/templates/*.tmpl'route:group_by: ['alertname', 'cluster', 'service']group_wait: 10sgroup_interval: 10srepeat_interval: 1hreceiver: 'web.hook'routes:- match:severity: 'critical'receiver: 'critical-alerts'- match:severity: 'warning'receiver: 'warning-alerts'receivers:- name: 'web.hook'webhook_configs:- url: 'http://localhost:5001/'- name: 'critical-alerts'email_configs:- to: 'oncall@yourcompany.com'subject: '{{ .GroupLabels.alertname }} - CRITICAL'body: |{{ range .Alerts }}Alert: {{ .Annotations.summary }}Description: {{ .Annotations.description }}Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }}{{ end }}{{ end }}slack_configs:- api_url: 'https://hooks.slack.com/services/your/slack/webhook'channel: '#critical-alerts'title: '{{ .GroupLabels.alertname }}'text: '{{ .CommonAnnotations.description }}'- name: 'warning-alerts'email_configs:- to: 'dev-team@yourcompany.com'subject: '{{ .GroupLabels.alertname }} - WARNING'slack_configs:- api_url: 'https://hooks.slack.com/services/your/slack/webhook'channel: '#warning-alerts'inhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'cluster', 'service']

第 4 章 Grafana 仪表板配置

4.1 数据源配置

创建 grafana/provisioning/datasources/datasource.yml:

apiVersion: 1datasources:- name: Prometheustype: prometheusaccess: proxyurl: http://prometheus:9090isDefault: trueeditable: truejsonData:timeInterval: 15shttpMethod: POST

4.2 仪表板配置

创建 grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1providers:- name: 'default'orgId: 1folder: ''type: filedisableDeletion: falseupdateIntervalSeconds: 10allowUiUpdates: trueoptions:path: /var/lib/grafana/dashboards

4.3 Spring Boot 应用仪表板

创建 grafana/dashboards/spring-boot-dashboard.json:

{"dashboard": {"id": null,"title": "Spring Boot Application Metrics","tags": ["spring-boot", "prometheus"],"timezone": "browser","panels": [{"id": 1,"title": "JVM Memory Usage","type": "stat","targets": [{"expr": "sum(container_memory_usage_bytes{container_label_io_kubernetes_pod_name=~'order-service.*'}) / (1024 * 1024)","legendFormat": "Memory Usage","refId": "A"}],"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},"fieldConfig": {"defaults": {"unit": "MB","thresholds": {"steps": [{"color": "green", "value": null},{"color": "red", "value": 80}]}}}},{"id": 2,"title": "HTTP Requests Rate","type": "graph","targets": [{"expr": "rate(http_server_requests_seconds_count[5m])","legendFormat": "Requests/sec","refId": "A"}],"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}}],"time": {"from": "now-6h", "to": "now"},"timepicker": {"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d"]}}
}

4.4 完整仪表板配置

由于完整的 JSON 配置很长,这里提供关键面板的配置思路:

// 完整的仪表板应包含以下面板:
{"panels": [// 1. 应用概览面板// 2. JVM 内存面板// 3. GC 统计面板  // 4. HTTP 请求面板// 5. 业务指标面板// 6. 系统资源面板// 7. 容器资源面板]
}

第 5 章 高级监控特性

5.1 自定义指标端点

创建自定义指标端点暴露业务指标:

package com.example.monitoring.config;import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.boot.actuate.endpoint.annotation.Endpoint;
import org.springframework.boot.actuate.endpoint.annotation.ReadOperation;
import org.springframework.stereotype.Component;import java.util.HashMap;
import java.util.Map;@Component
@Endpoint(id = "businessmetrics")
public class BusinessMetricsEndpoint {private final MeterRegistry meterRegistry;public BusinessMetricsEndpoint(MeterRegistry meterRegistry) {this.meterRegistry = meterRegistry;}@ReadOperationpublic Map<String, Object> businessMetrics() {Map<String, Object> metrics = new HashMap<>();// 获取订单相关指标double orderRate = meterRegistry.get("order.created").counter().count();double failureRate = meterRegistry.get("order.failed").counter().count();metrics.put("orders.created.total", orderRate);metrics.put("orders.failed.total", failureRate);metrics.put("orders.success.rate", orderRate > 0 ? (orderRate - failureRate) / orderRate * 100 : 100);return metrics;}
}

5.2 分布式追踪集成

添加分布式追踪支持:

<!-- 在 pom.xml 中添加 -->
<dependency><groupId>org.springframework.cloud</groupId><artifactId>spring-cloud-starter-sleuth</artifactId><version>3.1.0</version>
</dependency>
<dependency><groupId>org.springframework.cloud</groupId><artifactId>spring-cloud-sleuth-zipkin</artifactId><version>3.1.0</version>
</dependency>

配置追踪:

spring:sleuth:sampler:probability: 1.0zipkin:base-url: http://zipkin:9411

5.3 性能测试与监控验证

创建测试脚本验证监控系统:

package com.example.monitoring.test;import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.web.client.TestRestTemplate;
import org.springframework.http.ResponseEntity;
import org.springframework.test.context.ActiveProfiles;import static org.assertj.core.api.Assertions.assertThat;@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ActiveProfiles("test")
public class MonitoringIntegrationTest {@Autowiredprivate TestRestTemplate restTemplate;@Testpublic void testActuatorEndpoints() {// 测试健康检查端点ResponseEntity<String> healthResponse = restTemplate.getForEntity("/actuator/health", String.class);assertThat(healthResponse.getStatusCodeValue()).isEqualTo(200);// 测试指标端点ResponseEntity<String> metricsResponse = restTemplate.getForEntity("/actuator/metrics", String.class);assertThat(metricsResponse.getStatusCodeValue()).isEqualTo(200);// 测试Prometheus端点ResponseEntity<String> prometheusResponse = restTemplate.getForEntity("/actuator/prometheus", String.class);assertThat(prometheusResponse.getStatusCodeValue()).isEqualTo(200);}
}

第 6 章 生产环境部署与优化

6.1 生产级 Docker Compose

创建生产环境配置 docker-compose.prod.yml:

version: '3.8'services:order-service:deploy:replicas: 3resources:limits:memory: 1Gcpus: '0.5'reservations:memory: 512Mcpus: '0.25'restart_policy:condition: on-failuredelay: 5smax_attempts: 3configs:- source: app-configtarget: /app/config/application.ymlsecrets:- db-passwordprometheus:deploy:resources:limits:memory: 2Gcpus: '1.0'volumes:- prometheus_data:/prometheuscommand:- '--config.file=/etc/prometheus/prometheus.yml'- '--storage.tsdb.path=/prometheus'- '--web.console.libraries=/etc/prometheus/console_libraries'- '--web.console.templates=/etc/prometheus/consoles'- '--storage.tsdb.retention.time=30d'- '--storage.tsdb.retention.size=10GB'- '--web.enable-lifecycle'configs:app-config:file: ./config/application-prod.ymlsecrets:db-password:file: ./secrets/db_password.txt

6.2 监控数据持久化

配置数据备份和持久化:

# 备份 Prometheus 数据
docker exec prometheus tar czf - /prometheus > prometheus_backup.tar.gz# 恢复数据
cat prometheus_backup.tar.gz | docker exec -i prometheus tar xzf - -C /

6.3 安全配置

添加安全认证:

# 配置 Grafana 认证
grafana:environment:- GF_AUTH_ANONYMOUS_ENABLED=false- GF_AUTH_BASIC_ENABLED=true- GF_SECURITY_SECRET_KEY=your-secret-key

第 7 章 故障排查与性能优化

7.1 常见问题排查

7.1.1 指标无法收集
# 检查应用端点
curl http://localhost:8080/actuator/prometheus# 检查 Prometheus 目标状态
curl http://localhost:9090/api/v1/targets
7.1.2 内存泄漏排查
// 添加内存监控
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {return registry -> registry.config().commonTags("application", "order-service","region", System.getenv("REGION"));
}

7.2 性能优化建议

  1. 调整抓取间隔:根据应用负载调整
  2. 优化查询性能:使用记录规则预计算
  3. 数据保留策略:根据存储容量调整
  4. 资源限制:合理配置容器资源限制

总结

通过本文的完整配置,我们建立了一个生产级的 Spring Boot 应用监控系统,具备以下特性:

  • ✅ 全方位监控:应用、系统、容器多层监控
  • ✅ 实时告警:多级别、多通道告警机制
  • ✅ 可视化展示:丰富的 Grafana 仪表板
  • ✅ 生产就绪:高可用、安全、可扩展的配置
  • ✅ 业务集成:自定义业务指标监控
    这套监控体系能够帮助您及时发现和解决系统问题,确保 Spring Boot 应用在 Docker 环境中的稳定运行。
    后续优化方向:
  1. 集成日志监控(ELK/Loki)
  2. 实现自动化故障恢复
  3. 添加机器学习异常检测
  4. 建立监控数据分析和预测能力
    通过持续优化监控体系,您可以构建更加稳定、可靠的云原生应用系统。
http://www.dtcms.com/a/564573.html

相关文章:

  • git clone失败
  • Linux 命令与运维终极手册(2025 完整版)
  • 05-异常处理-导读
  • Pandas-之 数据聚合与分组
  • Rust之基础入门项目实战:构建一个简单的猜谜游戏
  • 数据结构之二叉树-初见介绍
  • 【Java 开发日记】finally 释放的是什么资源?
  • VsCode中终端无法运行前端命令
  • 【鸿蒙开发】鸿蒙 ArkTS 语言从零到一完整指南
  • 门户网站建设公司网页设计风格分类
  • 综合整理:pdf预览显示:你尝试预览的文件可能对你的计算机有害。如果你信任此文件以及其来源,请打开此文件以看其内容,如何解决以正常预览文件
  • 微服务拆分之SpringCloud
  • Unity与iOS原生交互开发入门篇 - iOS原生弹窗与回调
  • 企业网站推广在哪里办成免费crm推广网站
  • 本地的赣州网站建设网站访问量asp
  • 总局核名的办理条件
  • 不只是计算:昇腾算子开发中的内存管理艺术
  • 深入解析 Spring Boot 自动配置:原理、实践与进阶​
  • 【Unity卷轴特效实现、原理、与深度解析】
  • STM32 串口中断接收原理与实战详解:从配置到中断服务函数全流程解析
  • 【Linux系统】C/C++的调试器gdb/cgdb,从入门到精通
  • 从被搜索到被推荐:GEO重塑可见性逻辑
  • 如何为 Oracle 数据库配置 TLS/TCPS
  • 阿里云网站备案注销吗大数据做网站
  • pc网站做app京东湖北网站推广服务
  • 测试环境与正式环境同样的机器显示不同的网络问题
  • HTTP_HTTPS协议
  • Linux高效备份:tar与gzip完全指南
  • Java中的File类
  • 四、Linux设备驱动介绍