AIX环境ORACLE RAC节点无法加入集群问题分析
- 服务概述
故障现象:客户反馈7/13日发现RAC节点1主机内存使用异常,遂对主机进行重启。重启后发现RAC数据库节点1无法加入集群,节点2正常运行。
主机与数据库信息:
OS:AIX 6100-07-04-1216
数据库:11.2.0.2 两节点RAC。
工程师接到客户服务需求后,及时了解情况并到达客户现场进行故障排查、分析;通过对数据库RAC组件日志的分析、测试及主机层面的分析,最终定位到问题是因为主机层面安全加固产品配置变动,导致RAC集群重启后在加入集群时的节点通信方面出现异常,引起RAC节点无法加入集群的故障。
目前主机层面已经关闭了安全加固产品,RAC数据库已经恢复正常运行;本次故障的详情情况,请参阅本文档。
- RAC GRID启动异常现象分析
2.1 集群件启动状态
$ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE ipcrsdb1 Started
ora.cluster_interconnect.haip
1 ONLINE ONLINE ipcrsdb1
ora.crsd
1 ONLINE INTERMEDIATE ipcrsdb1
ora.cssd
1 ONLINE ONLINE ipcrsdb1
ora.cssdmonitor
1 ONLINE ONLINE ipcrsdb1
ora.ctssd
1 ONLINE ONLINE ipcrsdb1 OBSERVER
ora.diskmon
1 ONLINE ONLINE ipcrsdb1
ora.drivers.acfs
1 ONLINE ONLINE ipcrsdb1
ora.evmd
1 ONLINE INTERMEDIATE ipcrsdb1
ora.gipcd
1 ONLINE ONLINE ipcrsdb1
ora.gpnpd
1 ONLINE ONLINE ipcrsdb1
ora.mdnsd
1 ONLINE ONLINE ipcrsdb1
# /oracle/app/11.2/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
# ps -ef|grep has
root 4915440 8323158 0 17:36:23 pts/0 0:00 grep has
root 6488144 1 11 15:08:08 - 1:41 /oracle/app/11.2/grid/bin/ohasd.bin reboot
root 3080572 1 0 14:08:53 - 0:00 /bin/sh /etc/init.ohasd run
# ps -ef|grep crs
root 7209074 1 13 15:14:19 - 1:18 /oracle/app/11.2/grid/bin/crsd.bin reboot
# ps -ef|grep css
grid 4587534 6291598 23 15:08:26 - 1:58 /oracle/app/11.2/grid/bin/ocssd.bin
root 4915446 1 0 17:37:44 - 0:00 /oracle/app/11.2/grid/bin/cssdmonitor
root 6291598 1 0 15:08:26 - 0:00 /bin/sh /oracle/app/11.2/grid/bin/ocssd
root 5177602 1 0 17:37:44 - 0:00 /oracle/app/11.2/grid/bin/cssdagent
# ps -ef|grep css
grid 4587534 6291598 15 15:08:26 - 1:58 /oracle/app/11.2/grid/bin/ocssd.bin
root 4915446 1 0 17:37:44 - 0:00 /oracle/app/11.2/grid/bin/cssdmonitor
root 6291598 1 0 15:08:26 - 0:00 /bin/sh /oracle/app/11.2/grid/bin/ocssd
root 4194590 8323158 1 17:38:51 pts/0 0:00 grep css
root 5177602 1 0 17:37:44 - 0:00 /oracle/app/11.2/grid/bin/cssdagent
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
ONLINE ONLINE ipcrsdb2
ora.DATA.dg
ONLINE ONLINE ipcrsdb2
ora.LISTENER.lsnr
ONLINE ONLINE ipcrsdb2
ora.LISTENER1.lsnr
ONLINE ONLINE ipcrsdb2
ora.LISTENER2.lsnr
ONLINE ONLINE ipcrsdb2
ora.LISTENER_IPCSMGR1.lsnr
ONLINE ONLINE ipcrsdb2
ora.SYSDG.dg
ONLINE ONLINE ipcrsdb2
ora.asm
ONLINE ONLINE ipcrsdb2
ora.gsd
OFFLINE OFFLINE ipcrsdb2
ora.net1.network
ONLINE ONLINE ipcrsdb2
ora.ons
ONLINE ONLINE ipcrsdb2
ora.registry.acfs
ONLINE ONLINE ipcrsdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE ipcrsdb2
ora.cvu
1 ONLINE ONLINE ipcrsdb2
ora.ipcrsdb1.vip
1 ONLINE INTERMEDIATE ipcrsdb2 FAILED OVER
ora.ipcrsdb2.vip
1 ONLINE ONLINE ipcrsdb2
ora.ipcscat.db
1 ONLINE OFFLINE
ora.ipcsmgr.db
1 ONLINE OFFLINE
2 ONLINE ONLINE ipcrsdb2 Open
ora.ipcsqy.db
1 ONLINE OFFLINE
2 ONLINE ONLINE ipcrsdb2 Open
ora.oc4j
1 ONLINE ONLINE ipcrsdb2
ora.scan1.vip
1 ONLINE ONLINE ipcrsdb2
从这部分信息对比可以发现,节点1数据库集群组件启动异常,HAS可以启动,CRS无法启动;
2.2 集群件异常日志分析
结合集群启动的流程,对集群件相关日志进行分析,如下:
集群ALERT日志中的信息:
2025-04-13 15:08:09.040
[ohasd(6488144)]CRS-2112:The OLR service started on node ipcrsdb1.
2025-04-13 15:08:09.081
[ohasd(6488144)]CRS-1301:Oracle High Availability Service started on node ipcrsdb1.
2025-04-13 15:08:09.093
[ohasd(6488144)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2025-04-13 15:08:10.748
[/oracle/app/11.2/grid/bin/oraagent.bin(5570804)]CRS-5815:Agent '/oracle/app/11.2/grid/bin/oraagent_grid' could not find any base type entry points for type 'ora.daemon.type'. Details at (:CRSAGF00108:) {0:2:2} in /oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2025-04-13 15:08:11.170
[/oracle/app/11.2/grid/bin/orarootagent.bin(5898386)]CRS-5016:Process "/oracle/app/11.2/grid/bin/acfsload" spawned by agent "/oracle/app/11.2/grid/bin/orarootagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/orarootagent_root/orarootagent_root.log"
2025-04-13 15:08:21.930
[/oracle/app/11.2/grid/bin/oraagent.bin(4129174)]CRS-5815:Agent '/oracle/app/11.2/grid/bin/oraagent_grid' could not find any base type entry points for type 'ora.daemon.type'. Details at (:CRSAGF00108:) {0:5:2} in /oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2025-04-13 15:08:24.446
[ohasd(6488144)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2025-04-13 15:08:24.467
[gpnpd(5701712)]CRS-2328:GPNPD started on node ipcrsdb1.
2025-04-13 15:08:27.848
[cssd(4587534)]CRS-1713:CSSD daemon is started in clustered mode
2025-04-13 15:10:00.121
[cssd(4587534)]CRS-1707:Lease acquisition for node ipcrsdb1 number 1 completed
2025-04-13 15:10:01.706
[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk160; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.
2025-04-13 15:10:01.973
[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk161; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.
2025-04-13 15:10:02.245
[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk162; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.
2025-04-13 15:10:04.209
[cssd(4587534)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ipcrsdb1 ipcrsdb2 .
此处日志可以看到CSSD部分启动是正常的。
CRSD启动日志:
2025-04-13 15:14:20.505: [ default][1] CRS Daemon Starting
2025-04-13 15:14:20.510: [ default][515] Policy Engine is not initialized yet!
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: AGENT 1
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: AGFW 0
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLSFRAME 0
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLSVER 0
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLUCLS 0
2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CRSAPP 0
2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCCL 0
2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCEVT 0
2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCOMM 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSD 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSEVT 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSMAIN 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSOCR 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSPE 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSPLACE 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSRES 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSRPT 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSRTI 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSSE 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSSEC 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSTIMER 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: CRSUI 0
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRAPI 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRASM 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRCAC 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRCLI 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRMAS 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRMSG 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCROSD 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRRAW 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRSRV 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: OCRUTL 1
2025-04-13 15:14:20.516: [ CRSD][1] ENV Logging level for Module: SuiteTes 1
2025-04-13 15:14:20.517: [ CRSD][1] ENV Logging level for Module: UiServer 0
2025-04-13 15:14:20.517: [ CRSMAIN][1] Checking the OCR device
2025-04-13 15:14:20.517: [ CRSMAIN][1] Sync-up with OCR
2025-04-13 15:14:20.517: [ CRSMAIN][1] Connecting to the CSS Daemon
2025-04-13 15:14:20.517: [ CRSMAIN][1] Getting local node number
2025-04-13 15:14:20.518: [ CRSMAIN][1] Initializing OCR
[ CLWAL][1]clsw_Initialize: OLR initlevel [70000]
2025-04-13 15:14:21.295: [ OCRRAW][1]proprioo: for disk 0 (+DATA), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (80617), lsn (80617)
2025-04-13 15:14:21.295: [ OCRRAW][1]proprioo: my id set: (1862408427, 2024014047, 0, 0, 0)
2025-04-13 15:14:21.295: [ OCRRAW][1]proprioo: 1st set: (1862408427, 1028247821, 0, 0, 0)
2025-04-13 15:14:21.295: [ OCRRAW][1]proprioo: 2nd set: (1862408427, 2024014047, 0, 0, 0)
2025-04-13 15:14:21.308: [ OCRRAW][1]proprioo: for disk 1 (+sysdg), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (80617), lsn (80617)
2025-04-13 15:14:21.308: [ OCRRAW][1]proprioo: my id set: (1862408427, 2024014047, 0, 0, 0)
2025-04-13 15:14:21.308: [ OCRRAW][1]proprioo: 1st set: (1862408427, 1028247821, 0, 0, 0)
2025-04-13 15:14:21.308: [ OCRRAW][1]proprioo: 2nd set: (1862408427, 2024014047, 0, 0, 0)
2025-04-13 15:14:21.332: [ OCRSRV][1]th_init: Successfully retrieved CSS misscount [31].
2025-04-13 15:14:21.332: [ OCRSRV][1]th_init: Successfully query CLSS mode [3].
2025-04-13 15:14:21.336: [ OCRSRV][1]th_init:1: FROM PUBDATA Node num [2]Remote Listening Port [0] Cache invalidation port [0]
2025-04-13 15:14:21.336: [ OCRSRV][1]th_init:1.1: FROM PUBDATA Node num [2]CLSC Private IP or GIPC connect string [gipcha<ipcrsdb2><b055-1450-1a83-3acf><9894-3b98-d74c-4465>]
[ OCRMAS][1]th_calc_av:5': Rturn persisted AV [186647040] [11.2.0.2.0]
2025-04-13 15:14:21.338: [ OCRSRV][1]th_init: Decided to use GIPC
2025-04-13 15:14:21.339: [GIPCHGEN][2057] gipchaInternalRegister: Initializing HA GIPC
2025-04-13 15:14:21.339: [GIPCHGEN][2057] gipchaNodeCreate: adding new node 111d02db0 { host '', haName '18b5-ba0a-c21b-d9e8', srcLuid 4aafeee2-00000000, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507492, sentRegister 0, localMonitor 0, flags 0x1 }
2025-04-13 15:14:21.339: [GIPCHTHR][2314] gipchaWorkerThread: starting worker thread hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x0 }
2025-04-13 15:14:21.339: [GIPCHDEM][2571] gipchaDaemonThread: starting daemon thread hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x0 }
2025-04-13 15:14:21.425: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 112063730 { host '', haName '18b5-ba0a-c21b-d9e8', local 0, ip '1.1.1.1', subnet '1.1.1.0', mask '255.255.255.0', mac '5c-f3-fc-ec-7e-6c', ifname 'en6', numRef 0, numFail 0, idxBoot 0, flags 0x1 }
2025-04-13 15:14:21.430: [GIPCHTHR][2314] gipchaWorkerUpdateInterface: created local interface for node 'ipcrsdb1', haName '18b5-ba0a-c21b-d9e8', inf '1.1.1.1:9930'
2025-04-13 15:14:21.628: [GIPCXCPT][2057] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '6589-7b37-1f59-419d', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)
2025-04-13 15:14:21.629: [GIPCHGEN][2057] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '6589-7b37-1f59-419d', flags 0x0
2025-04-13 15:14:21.630: [ OCRMSG][2057]prom_listen: Port str [6589-7b37-1f59-419d]
2025-04-13 15:14:21.630: [ OCRSRV][2057]proath_listen: listening to remote requests at portstr [6589-7b37-1f59-419d]
2025-04-13 15:14:21.632: [GIPCXCPT][3085] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '9a1e-23b3-a0c6-dd0a', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)
2025-04-13 15:14:21.632: [GIPCHGEN][3085] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '9a1e-23b3-a0c6-dd0a', flags 0x0
2025-04-13 15:14:21.633: [ OCRMSG][3085]prom_listen: Port str [9a1e-23b3-a0c6-dd0a]
2025-04-13 15:14:21.633: [ OCRSRV][3085]th_invalidate_cache: listening to cache_invalidation requests at portstr [9a1e-23b3-a0c6-dd0a]
[ OCRMAS][3342]th_calc_av:5': Rturn persisted AV [186647040] [11.2.0.2.0]
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1066219116
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647040
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647040
2025-04-13 15:14:21.643: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha<ipcrsdb1><6589-7b37-1f59-419d><9a1e-23b3-a0c6-dd0a>
2025-04-13 15:14:21.655: [ OCRMAS][3342]rcfg_con:1: Member [1] joined. Inc [8].
2025-04-13 15:14:21.655: [ OCRMAS][3342]proath_master: Master changing. old master [-1] new master [2] num_rcfg [1].
2025-04-13 15:14:21.657: [ OCRSRV][3342]th_not_master_change: Master change callback not registered
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1066219116
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647040
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647040
2025-04-13 15:14:21.657: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha<ipcrsdb2><b055-1450-1a83-3acf><9894-3b98-d74c-4465>
2025-04-13 15:14:21.657: [ OCRMAS][3342]th_connect_master: Using GIPC type to connect
2025-04-13 15:14:21.657: [ OCRMAS][3342]th_connect_master:10: Master host name [ipcrsdb2]
2025-04-13 15:14:21.657: [ OCRMAS][3342]proath_connect_master: Attempting to connect to master at address [ipcrsdb2:b055-1450-1a83-3acf]
2025-04-13 15:14:21.658: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '5144-3d10-a8ef-7ad5', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)
2025-04-13 15:14:21.658: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '5144-3d10-a8ef-7ad5', flags 0x0
2025-04-13 15:14:21.663: [GIPCHGEN][2571] gipchaNodeCreate: adding new node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 0, localMonitor 0, flags 0x0 }
2025-04-13 15:14:21.663: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111f3e078 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-00000000, msgId 0 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x0 }
2025-04-13 15:14:21.663: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 4200507816 ms, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x4 }
2025-04-13 15:14:21.674: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 112063dd0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', local 0, ip '1.1.1.2:38535', subnet '1.1.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 }
2025-04-13 15:14:21.683: [GIPCHTHR][2314] gipchaWorkerUpdateInterface: created remote interface for node 'ipcrsdb2', haName '44ed-6357-b445-b29c', inf 'udp://1.1.1.2:38535'
2025-04-13 15:14:21.683: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 112063dd0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', local 112063730, ip '1.1.1.2:38535', subnet '1.1.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
2025-04-13 15:14:24.667: [GIPCHALO][2314] gipchaLowerProcessAcks: ESTABLISH finished for node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 1, lastAck 4, lastValidAck 0, sendSeq [3 : 3], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x20c }
2025-04-13 15:14:24.667: [GIPCHALO][2314] gipchaLowerProcessWaitQ: triggering deffered startup of msg 111f3e078 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-00000000, msgId 0 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 1, lastAck 4, lastValidAck 0, sendSeq [4 : 4], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x208 }
2025-04-13 15:14:24.690: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '1ecb-14df-9bc8-a079', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)
2025-04-13 15:14:24.690: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '1ecb-14df-9bc8-a079', flags 0x0
2025-04-13 15:14:24.714: [ OCRMAS][3342]proath_master: SUCCESSFULLY CONNECTED TO THE MASTER
2025-04-13 15:14:24.714: [ OCRMAS][3342]th_master: NEW OCR MASTER IS 2
2025-04-13 15:14:24.716: [ OCRSRV][1]th_snap_local_spawn: Inside snap local spawn. host is [ipcrsdb1]
2025-04-13 15:14:24.722: [ CRSMAIN][1] Running mode check...
2025-04-13 15:14:24.722: [ CRSMAIN][1] Running as user: root
2025-04-13 15:14:24.722: [ CRSMAIN][1] CRSD running as the Privileged user
2025-04-13 15:14:24.723: [ CRSMAIN][1] Checking compatibility...
2025-04-13 15:14:24.723: [ CLSVER][1] Static Version 11.2.0.2.0
2025-04-13 15:14:50.516: [ CRSMAIN][515] Policy Engine is not initialized yet!
2025-04-13 15:14:54.739: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30005, msg 11208e6f8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 9 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:14:55.737: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30003, msg 11240b438 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 12 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
*******省略输出
2025-04-13 15:15:16.862: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30001, msg 1130298d8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 54 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:17.876: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 11302c178 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 56 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:18.875: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30004, msg 11302ea18 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 58 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:19.885: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30007, msg 1130312b8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 60 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:20.516: [ CRSMAIN][515] Policy Engine is not initialized yet!
2025-04-13 15:15:20.891: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30003, msg 113033b58 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 62 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
*******省略输出
2025-04-13 15:15:49.134: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 11301fed8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 118 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:50.139: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 113022778 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 120 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:50.515: [ CRSMAIN][515] Policy Engine is not initialized yet!
2025-04-13 15:15:51.152: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 113025018 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 122 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:52.160: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30010, msg 1130278b8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 124 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:15:53.173: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 11302a158 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 126 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
*******省略输出
016-07-13 15:16:19.353: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30011, msg 11240b378 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 178 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:20.347: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 11304da18 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 180 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:20.518: [ CRSMAIN][515] Policy Engine is not initialized yet!
2025-04-13 15:16:21.353: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 1130174f8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 182 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:22.368: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 113019d98 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 184 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:23.373: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 11301c638 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 186 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:24.380: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 11301eed8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 188 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }
2025-04-13 15:16:50.519: [ CRSMAIN][515] Policy Engine is not initialized yet!
2025-04-13 15:17:20.519: [ CRSMAIN][515] Policy Engine is not initialized yet!
从CRSD.LOG中可以发现一些异常的信息,在下一部分进行详解。
2.3 故障分析
通过对集群状态及集群日志的检查分析,并参考ORACLE官方文档相关信息,可以初步定位到集群启动到CRSD时出现异常;对可能导致集群启动CRSD异常的各种原因(如OCR无法访问、GPNP PROFILE异常、PID文件异常、OCR文件异常、私网Jumbo Frame、多播、主机特定版本BUG等),进行逐一排查,初步定位到是集群组件启动时网络通信方面异常引起此问题。
通过对网络方面排查,并与客户工程师进行沟通,了解到近期对RAC节点AIX主机层面的IPSEC进行了安全加固;最终经过测试发现是因为此原因导致RAC只有一个节点能够启动,在另一节点加入集群时会因通信异常导致此节点集群件无法启动。
按照ORACLE官方建议,同时停止两个节点的集群件;对主机层面的安全加固策略进行清理后,再启动两个RAC节点后恢复正常。
参考文档:
Grid Infrastructure 启动问题 (文档 ID 1623340.1)
Grid Infrastructure 启动的五大问题 (文档 ID 1526147.1)
AIX: After Installing PowewSC: CRSD on the 2nd node is not able to Start with "proath_master:100b: Polling, connect to master not complete" (文档 ID 2056429.1)中。
- 总结与建议
3.1 后续建议
1.建议关闭AIX主机上的IPSEC安全加固配置;同时对于其它AIX+RAC架构的环境,对安全加固相关配置进行检查。
2.现场沟通了解到当前的系统架构是AIX+两节点RAC的架构,未配置有DATAGUARD或者其它方式的容灾;建议参考ORACLE官方对于MAA架构(ORACLE Maximum Availability Architecture)的建议,通过配置单节点或两节点RAC 做DATAGURD来实现数据库层面的最高可用(或者通过第三方产品(OGG等)来提高数据库系统的可用性)。