当前位置：首页 > news >正文

livenessProbe 和 readinessProbe 最佳实践

news 2025/7/13 17:09:03

在 Kubernetes 中，livenessProbe 和 readinessProbe 是确保应用高可用性的关键机制，但配置不当可能导致应用频繁重启或流量中断。以下是配置这两个探针的最佳实践：

1. 核心区别与作用

探针类型	目的	失败后果
livenessProbe	检测应用是否存活（如死锁、内存泄漏）。	重启容器（`kubectl restart`）
readinessProbe	检测应用是否准备好接收流量（如依赖服务未就绪、缓存未加载）。	从 Service 的 Endpoints 移除 Pod

2. 最佳实践指南

(1) 明确区分两个探针的用途

livenessProbe：
- 保守检查：仅检测应用是否处于不可恢复的故障状态（如主线程崩溃）。一般服务没挂掉就认为是正常的。
- 避免敏感条件：例如，不要用业务逻辑失败（如数据库连接超时）触发重启，除非明确需要。
readinessProbe：
- 严格检查：确保所有依赖（如数据库、通过 feign 调用的外部组件、中间件）就绪后才接收流量。
- 动态调整：在运行时若依赖服务不可用（如 Redis 宕机），应通过 readinessProbe 主动拒绝流量。

(2) 设置合理的检查端点

为 livenessProbe 和 readinessProbe 使用不同的 HTTP 路径：

livenessProbe:httpGet:path: /health/liveness   # 轻量级存活检查（仅进程存活）port: 8080
readinessProbe:httpGet:path: /health/readiness  # 包含依赖检查（如数据库连接）port: 8080

(3) 配置合理的参数

livenessProbe:httpGet:path: /actuator/health/livenessport: 8080initialDelaySeconds: 10   # 应用启动后等待 10 秒再开始探测periodSeconds: 5          # 每 5 秒检查一次timeoutSeconds: 3         # 超时时间设为 3 秒failureThreshold: 3       # 连续失败 3 次后判定为故障readinessProbe:httpGet:path: /actuator/health/readinessport: 8080initialDelaySeconds: 5    # 比 livenessProbe 更早开始检查periodSeconds: 5timeoutSeconds: 3failureThreshold: 1       # 1 次失败即标记为未就绪

关键参数说明：

initialDelaySeconds：必须设置，避免应用未完成初始化就被判定为失败（如 JVM 启动慢）。
failureThreshold：
- livenessProbe 可设置较高（如 3），避免偶发故障触发重启。
- readinessProbe 可设置较低（如 1），快速从负载均衡中剔除异常 Pod。
periodSeconds 和 timeoutSeconds：根据应用响应时间调整，避免超时误判。

(4) 结合 `startupProbe` 处理慢启动应用

对于启动时间较长的应用（如 Java 服务），使用 startupProbe 延迟 livenessProbe 和 readinessProbe 的启动：

startupProbe:httpGet:path: /actuator/health/startupport: 8080failureThreshold: 30  # 允许最多 30 次检查失败periodSeconds: 5      # 每 5 秒检查一次# 总等待时间 = failureThreshold * periodSeconds = 150 秒

如果想了解 k8s 为什么要专门提出 startupProb 来解决慢服务启动的问题，而不是直接把 livenessPro 中初始化的时间设置的长一点。可以参考下面的文章《为什么需要启动探针（StartupProb）？》。

(5) 选择适当的探测类型

HTTP GET：适合 Web 服务，通过状态码（2xx/3xx 表示成功）判断。

Exec：执行命令，返回 0 表示成功（适合非 HTTP 服务）：

readinessProbe:exec:command:- /app/check-dependency.sh  # 自定义脚本检查依赖

TCP Socket：仅检查端口是否开放（适用非 HTTP 协议）。

(6) 避免依赖下游服务

readinessProbe 不要深度检查外部依赖（如数据库、API）：
- 若外部服务宕机，所有 Pod 都会被标记为未就绪，导致全面故障。
- 改为在应用内部实现熔断机制（如 Hystrix），部分功能降级。

(7) 日志与监控

记录探针检查结果：在 /health 端点返回详细信息（如版本、依赖状态）。
监控探针失败：通过 Prometheus 监控 kubelet_probe_errors_total，及时报警。

3. 示例配置

Spring Boot 应用

apiVersion: apps/v1
kind: Deployment
spec:template:spec:containers:- name: applivenessProbe:httpGet:path: /actuator/health/livenessport: 8080initialDelaySeconds: 30periodSeconds: 10failureThreshold: 3readinessProbe:httpGet:path: /actuator/health/readinessport: 8080initialDelaySeconds: 5periodSeconds: 5failureThreshold: 1startupProbe:httpGet:path: /actuator/health/startupport: 8080failureThreshold: 30periodSeconds: 5