Ceph集群故障处理 - PG不一致修复
Ceph集群故障处理 - PG不一致修复
- 目录
- 故障现象
- 故障分析
- 故障定位
- 修复过程
- 磁盘状态检查
- OSD存储结构检查
- 修复分析
- 故障总结
- 问题原因
- 修复方法
- 后续建议
- 经验教训
- 最佳实践
- 参考资料
# ceph -v
ceph version 14.2.22
目录
- 故障现象
- 故障分析
- 故障定位
- 修复过程
- 磁盘状态检查
- OSD存储结构检查
- 修复分析
- 故障总结
- 参考资料
故障现象
通过ceph -s
命令发现集群处于HEALTH_ERR
状态,显示存在scrub错误和数据不一致问题:
# ceph -scluster:id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxhealth: HEALTH_ERR1 scrub errorsPossible data damage: 1 pg inconsistent
从输出可以看出:
- 集群有1个scrub错误
- 有可能的数据损坏:1个PG处于不一致状态
- 另有1个PG正在进行深度清洗(deep scrubbing)
故障分析
使用ceph health detail
命令获取更详细的错误信息:
# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistentpg 2.8d8 is active+clean+inconsistent, acting [159,79,609,286,11,431,355,398,490,210]
从上述输出可以看出:
- 有1个PG处于不一致状态:
2.8d8
- 该PG上的acting OSD列表包括10个OSD:[159,79,609,286,11,431,355,398,490,210]
- 虽然PG状态为
active+clean+inconsistent
,表示它可以提供读写服务,但存在数据一致性问题
注意:
inconsistent
标志意味着在scrub或deep-scrub过程中,Ceph发现了数据不一致的问题。这可能是由于磁盘故障、I/O错误或其他硬件问题导致的。
故障定位
找到主OSD(159)的位置信息:
# ceph osd find 159
{"osd": 159,"addrs": {"addrvec": [{"type": "v2","addr": "10.x.x.x:6950","nonce": xxxxx},{"type": "v1","addr": "10.x.x.x:6951","nonce": xxxxx}]},"osd_fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","host": "osd-host03","crush_location": {"host": "osd-host03","root": "default"}
}
OSD.159位于主机osd-host03上,这是PG 2.8d8的主OSD。在Ceph的PG架构中,主OSD负责协调该PG上的所有操作,包括读取、写入和scrub操作。
修复过程
确认问题后,执行PG修复命令:
# ceph pg repair 2.8d8
instructing pg 2.8d8s0 on osd.159 to repair
ceph pg repair
命令会指示主OSD尝试修复PG中的数据不一致问题。修复过程中,Ceph会从健康的副本中恢复数据。
查看OSD日志以确认修复进展:
# tail -n 50 /var/log/ceph/ceph-osd.159.log2025-05-10 21:33:43.258 xxxxxxxxx 0 log_channel(cluster) log [DBG] : 2.8d8 repair starts
2025-05-10 21:35:03.066 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 shard 398(7) soid 2:xxxxxxxx:::xxxxxxxxxx.xxxxxxxx:head : candidate had a read error
2025-05-10 21:35:43.607 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8s0 repair 0 missing, 1 inconsistent objects
2025-05-10 21:35:43.608 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 repair 1 errors, 1 fixed
从日志可以清楚地看到:
- 修复从21:33:43开始
- 在21:35:03,发现了问题所在:OSD.398(acting列表中的第7个OSD,即shard 398(7))上的对象出现读取错误
- 在21:35:43,修复完成,报告修复了1个不一致对象
磁盘状态检查
通过日志发现分片398(7)上存在读取错误,需要进一步检查OSD.398使用的磁盘:
# smartctl -a /dev/sdah
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===
Device Model: WUH721816ALE6L4
Serial Number: XXXXXXXXX
LU WWN Device Id: 5 000cca XXXXXXXXX
Add. Product Id: XXXXXX
Firmware Version: PCGAW270
User Capacity: 16,000,900,661,248 bytes [16.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (unknown minor revision code: 0x009c)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat May 10 22:19:33 2025 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:
Offline data collection status: (0x82) Offline data collection activitywas completed without error.Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completedwithout error or no self-test has ever been run.
Total time to complete Offline
data collection: ( 101) seconds.SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE1 Raw_Read_Error_Rate 0x000b 100 100 001 Pre-fail Always - 02 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 963 Spin_Up_Time 0x0007 083 083 001 Pre-fail Always - 336 (Average 339)4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 365 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 07 Seek_Error_Rate 0x000b 100 100 001 Pre-fail Always - 08 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 159 Power_On_Hours 0x0012 096 096 000 Old_age Always - 3406910 Spin_Retry_Count 0x0013 100 100 001 Pre-fail Always - 012 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3622 Unknown_Attribute 0x0023 100 100 025 Pre-fail Always - 100
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 65536
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1455
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 1455
194 Temperature_Celsius 0x0002 063 063 000 Old_age Always - 32 (Min/Max 17/49)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0012 100 100 000 Old_age Always - 68388904427
242 Total_LBAs_Read 0x0012 100 100 000 Old_age Always - 4664543043SMART Error Log Version: 1
ATA Error Count: 2Error 2 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:ER ST SC SN CL CH DH-- -- -- -- -- -- --40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0Commands leading to the command that caused the error were:CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name-- -- -- -- -- -- -- -- ---------------- --------------------60 00 00 80 3f 91 40 00 2d+01:31:42.070 READ FPDMA QUEUED61 00 08 80 34 70 40 00 2d+01:31:39.514 WRITE FPDMA QUEUED60 00 00 80 36 97 40 00 2d+01:31:39.463 READ FPDMA QUEUED60 00 00 00 8a f9 40 00 2d+01:31:39.455 READ FPDMA QUEUED60 00 00 80 c7 41 40 00 2d+01:31:39.450 READ FPDMA QUEUEDError 1 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)When the command that caused the error occurred, the device was active or idle.
磁盘检查结果分析:
关键指标 | 值 | 解释 |
---|---|---|
整体健康评估 | PASSED | 磁盘自我评估仍为通过状态 |
Current_Pending_Sector | 16 | 有16个扇区存在读取问题 |
UNC错误 | 2个 | 不可纠正的读取错误 |
Power_On_Hours | 34069 | 运行时间约3.9年 |
Command_Timeout | 65536 | 出现过命令超时情况 |
虽然磁盘整体评估为PASSED,但出现的Current_Pending_Sector和UNC错误是严重的预警信号,表明该磁盘已经开始出现物理故障,应当计划更换。
OSD存储结构检查
使用ceph-volume lvm list
命令检查OSD.398的存储配置:
===== osd.398 ======[block] /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxblock device /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxblock uuid XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXXcephx lockbox secret cluster fsid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxcluster name cephcrush device class Nonedb device /dev/vg_nvme1n1/lv_sdahdb uuid XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXXencrypted 0osd fsid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxosd id 398osdspec affinity type blockvdo 0devices /dev/sdah[db] /dev/vg_nvme1n1/lv_sdahblock device /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxblock uuid XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXXcephx lockbox secret cluster fsid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxcluster name cephcrush device class Nonedb device /dev/vg_nvme1n1/lv_sdahdb uuid XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXXencrypted 0osd fsid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxosd id 398osdspec affinity type dbvdo 0devices /dev/nvme1n1
存储配置分析:
- OSD.398使用/dev/sdah作为存储设备
- 使用NVMe设备(/dev/nvme1n1)上的卷组作为DB设备
- 采用了BlueStore存储引擎
- 该OSD的数据位于LVM逻辑卷上
修复分析
通过日志分析修复过程:
- 2025-05-10 21:33:43 - 修复操作启动(PG 2.8d8)
- 2025-05-10 21:35:03 - 发现分片398(7)上的对象出现读取错误
- 2025-05-10 21:35:43 - 修复结果统计:0个缺失对象,1个不一致对象
- 2025-05-10 21:35:43 - 修复完成:发现1个错误,修复了1个错误
修复是成功的,Ceph能够自动从正常的副本恢复数据。这体现了Ceph的自愈能力,当检测到数据不一致问题时,可以通过repair操作从健康副本恢复数据。
故障总结
问题原因
- 磁盘故障:PG 2.8d8中存在一个不一致的对象,具体在OSD.398(acting列表中的第7个OSD)上发生了读取错误
- 磁盘健康状况:SMART检查显示/dev/sdah磁盘存在16个待处理扇区和UNC读取错误
- 磁盘年龄:磁盘已运行约3.9年,开始出现读取错误
- 故障模式:这是典型的磁盘扇区读取错误导致的数据不一致问题
修复方法
- 使用Ceph自愈功能:通过
ceph pg repair 2.8d8
命令成功修复了不一致的对象 - 数据恢复:Ceph自动从其他健康副本恢复了正确的数据
- 修复过程:由主OSD.159协调,确认并修复了OSD.398上的问题对象
后续建议
-
磁盘监控:
- 密切监控OSD.398所在磁盘(/dev/sdah)的健康状态
- 每周运行一次smartctl检查
- 配置监控系统报警Current_Pending_Sector值变化
-
磁盘更换计划:
- 考虑尽快更换该磁盘,因为存在扇区错误可能会导致更多数据问题
- 准备好替换磁盘,并安排合适的维护窗口进行更换
-
验证步骤:
- 运行完整的SMART自检:
smartctl -t long /dev/sdah
- 再次执行deep-scrub验证PG健康:
ceph pg deep-scrub 2.8d8
- 验证集群状态:
ceph -s
- 运行完整的SMART自检:
经验教训
- 主动监测的重要性:定期执行deep scrub可以及时发现数据不一致问题
- 修复流程:对于不一致的PG,使用
ceph pg repair
命令可以高效修复 - 后续验证:修复后应检查集群状态确认问题已解决
- 硬件监控:磁盘的Current_Pending_Sector和SMART错误日志是预测硬盘故障的重要指标
- 冗余设计的价值:正是因为Ceph的多副本设计,才能在单个OSD出现问题时保证数据的完整性
最佳实践
-
清洗策略优化:
- 设置适当的scrub和deep scrub策略
- 建议工作时间轻负载执行scrub,周末执行deep-scrub
- 对于大型集群,错开不同OSD的scrub时间
-
数据保护:
- 对于重要数据,考虑增加副本数量或使用纠删码
- 定期验证备份策略和灾难恢复流程
-
硬件管理:
- 保持硬件设备健康,避免磁盘读写错误
- 建立硬盘SMART监控系统,及时发现潜在问题
- 对运行3年以上的硬盘进行更严格的监控
- 实施预防性替换策略,而不是等到磁盘完全故障
-
监控系统增强:
- 配置自动化监控工具监控SMART属性
- 设置关键指标的阈值报警
- 建立磁盘健康评分系统,综合评估磁盘状况
参考资料
- Ceph PG 状态说明
- Ceph 数据一致性
- Ceph 故障排除
- SMART属性解读
- Ceph数据修复命令
- 存储硬盘预测性故障分析