ODA服务器计算节点本地硬盘状态异常的处理
近期,在系统巡检过程中发现一个客户的ODA服务器本地硬盘节点出现告警,ODAX8 X9等,本地硬盘是使用的240GB M.2接口的SSD盘(卡式)的,这个没有外置的指示灯可以从服务器前面板查看,打开服务器机箱盖子倒是可以看到M.2盘上面有绿色指示灯,但是一般巡检不会看这个。
因此这个问题很有隐蔽性,2块M.2接口的SSD盘做的RAID1,系统层面巡检一般也不会注意到该问题,需要通过特定的命令 cat /proc/mdstat、odaadmcli show localdisk来查看,同时通过ILOM查看时候STORAGE菜单里面显示的盘状态是正常,系统日志里可能会有日志显示盘INSERT/REMOTE,可以参考。
本次通过插拔硬盘和重启主机后,告警恢复了。
如下是本次的处理过程:
状态检查:
[root@hisdb2 ~]# odaadmcli show localdisk
NAME PATH TYPE STATUS STATE_IN_ILOM
lpd_0 sda SSD GOOD GOOD
lpd_1 N/A SSD MISSING GOOD ====这个盘损坏了
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda[0]
234425344 blocks super external:/md127/0 [2/1] [U_] ==正常是UU,这里是U_
md127 : inactive sda[0](S)
5201 blocks super external:imsm
unused devices: <none>
ILOM里面的日志:可以看到有盘INSERT/REMOVED,类似盘不稳定,
重启和插拔硬盘后,系统恢复:
需要注意,ODA的服务需要开启集群软件,所以不能吧CRS开机启动关闭:
May 8 17:41:46 hisdb2 su: (to grid) root on none
May 8 17:41:46 hisdb2 su: (to root) root on none
May 8 17:41:46 hisdb2 su: (to root) root on none
May 8 17:43:14 hisdb2 init.oak: 2025-05-08 17:43:14.460969204:[init.oak]:[Waiting for Cluster Ready Services. Diagnostics in /tmp/crsctl.4142]
May 8 17:45:45 hisdb2 init.oak: 2025-05-08 17:45:45.619750299:[init.oak]:[Waiting for Cluster Ready Services. Diagnostics in /tmp/crsctl.4142]
重启后识别到2块M2硬盘,系统会自动同步数据修复RAID,日志如下:
[root@hisdb2 ~]# odaadmcli show localdisk
NAME PATH TYPE STATUS STATE_IN_ILOM
lpd_0 sda SSD GOOD GOOD
lpd_1 sdb SSD GOOD GOOD
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sdb[1] sda[0]
234425344 blocks super external:/md127/0 [2/1] [U_]
[=====>...............] recovery = 25.8% (60691392/234425344) finish=14.3min speed=202388K/sec
md127 : inactive sda[1](S) sdb[0](S)
10402 blocks super external:imsm
unused devices: <none>
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sdb[1] sda[0]
234425344 blocks super external:/md127/0 [2/1] [U_]
[======>..............] recovery = 32.2% (75616576/234425344) finish=13.2min speed=199081K/sec
md127 : inactive sda[1](S) sdb[0](S)
10402 blocks super external:imsm
unused devices: <none>
最终状态:
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sdb[1] sda[0]
234425344 blocks super external:/md127/0 [2/2] [UU]
md127 : inactive sda[1](S) sdb[0](S)
10402 blocks super external:imsm
unused devices: <none>
[root@hisdb2 ~]# odaadmcli show localdisk
NAME PATH TYPE STATUS STATE_IN_ILOM
lpd_0 sda SSD GOOD GOOD
lpd_1 sdb SSD GOOD GOOD
可以参考的MESSAGE日志
May 8 18:16:20 hisdb2 kernel: md/raid1:md126: active with 1 out of 2 mirrors
May 8 18:16:20 hisdb2 kernel: md126: detected capacity change from 0 to 240051552256
May 8 18:16:20 hisdb2 kernel: md126: p1 p2 p3
May 8 18:16:20 hisdb2 systemd: Starting MD Metadata Monitor on /dev/md127...
May 8 18:16:20 hisdb2 systemd: Started MD Metadata Monitor on /dev/md127.
May 8 18:16:20 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:21 hisdb2 kernel: EXT4-fs (md126p2): mounted filesystem with ordered data mode. Opts: (null)
May 8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May 8 18:16:21 hisdb2 kernel: md: md126 still in use.
May 8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May 8 18:16:21 hisdb2 kernel: md: md126 still in use.
May 8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May 8 18:16:21 hisdb2 kernel: md: md126 still in use.
May 8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May 8 18:16:21 hisdb2 kernel: md: md126 still in use.
May 8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May 8 18:16:21 hisdb2 kernel: md: md126 still in use.
May 8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May 8 18:16:22 hisdb2 systemd: Stopped MD Metadata Monitor on /dev/md127.
May 8 18:16:24 hisdb2 systemd: Starting MD Metadata Monitor on /dev/md127...
May 8 18:16:24 hisdb2 systemd: Started MD Metadata Monitor on /dev/md127.
May 8 18:16:24 hisdb2 systemd-fsck: /dev/md126p2: clean, 67/128016 files, 148390/512000 blocks
May 8 18:16:24 hisdb2 kernel: EXT4-fs (md126p2): mounted filesystem with ordered data mode. Opts: (null)
May 8 18:16:24 hisdb2 kernel: FAT-fs (md126p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
May 8 18:17:21 hisdb2 systemd: rc-local.service: control process exited, code=exited status=127
May 8 18:31:12 hisdb2 systemd: Starting Cleanup of Temporary Directories...
May 8 18:31:12 hisdb2 systemd: Started Cleanup of Temporary Directories.
May 8 18:36:11 hisdb2 kernel: md: md126: recovery done.
t同时,linux lsblk命令也可以看到2块盘对应了系统的分区:
[root@hisdb2 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 223.6G 0 disk
└─md126 9:126 0 223.6G 0 raid1
├─md126p2 259:1 0 500M 0 md /boot
├─md126p3 259:2 0 222.6G 0 md
│ ├─VolGroupSys-LogVolOpt 252:20 0 30G 0 lvm /opt
│ ├─VolGroupSys-LogVolSwap 252:1 0 24G 0 lvm [SWAP]
│ ├─VolGroupSys-LogVolU01 252:21 0 40G 0 lvm /u01
│ └─VolGroupSys-LogVolRoot 252:0 0 30G 0 lvm /
└─md126p1 259:0 0 500M 0 md /boot/efi
sda 8:0 0 223.6G 0 disk
└─md126 9:126 0 223.6G 0 raid1
├─md126p2 259:1 0 500M 0 md /boot
├─md126p3 259:2 0 222.6G 0 md
│ ├─VolGroupSys-LogVolOpt 252:20 0 30G 0 lvm /opt
│ ├─VolGroupSys-LogVolSwap 252:1 0 24G 0 lvm [SWAP]
│ ├─VolGroupSys-LogVolU01 252:21 0 40G 0 lvm /u01
│ └─VolGroupSys-LogVolRoot 252:0 0 30G 0 lvm /
└─md126p1 259:0 0 500M 0 md /boot/efi