当前位置: 首页 > news >正文

AIX环境ORACLE RAC节点无法加入集群问题分析

  • 服务概述

故障现象:客户反馈7/13日发现RAC节点1主机内存使用异常,遂对主机进行重启。重启后发现RAC数据库节点1无法加入集群,节点2正常运行。

主机与数据库信息:

OS:AIX 6100-07-04-1216

数据库:11.2.0.2 两节点RAC。

工程师接到客户服务需求后,及时了解情况并到达客户现场进行故障排查、分析;通过对数据库RAC组件日志的分析、测试及主机层面的分析,最终定位到问题是因为主机层面安全加固产品配置变动,导致RAC集群重启后在加入集群时的节点通信方面出现异常,引起RAC节点无法加入集群的故障。

目前主机层面已经关闭了安全加固产品,RAC数据库已经恢复正常运行;本次故障的详情情况,请参阅本文档。

  • RAC GRID启动异常现象分析

2.1 集群件启动状态

$ crsctl stat res -t -init

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS       

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

      1        ONLINE  ONLINE       ipcrsdb1                 Started             

ora.cluster_interconnect.haip

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.crsd

      1        ONLINE  INTERMEDIATE ipcrsdb1                                     

ora.cssd

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.cssdmonitor

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.ctssd

      1        ONLINE  ONLINE       ipcrsdb1                 OBSERVER            

ora.diskmon

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.drivers.acfs

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.evmd

      1        ONLINE  INTERMEDIATE ipcrsdb1                                     

ora.gipcd

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.gpnpd

      1        ONLINE  ONLINE       ipcrsdb1                                     

ora.mdnsd

      1        ONLINE  ONLINE       ipcrsdb1

# /oracle/app/11.2/grid/bin/crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

# ps -ef|grep has

    root 4915440 8323158   0 17:36:23  pts/0  0:00 grep has

    root 6488144       1  11 15:08:08      -  1:41 /oracle/app/11.2/grid/bin/ohasd.bin reboot

root 3080572       1   0 14:08:53      -  0:00 /bin/sh /etc/init.ohasd run

# ps -ef|grep crs

    root 7209074       1  13 15:14:19      -  1:18 /oracle/app/11.2/grid/bin/crsd.bin reboot

# ps -ef|grep css

    grid 4587534 6291598  23 15:08:26      -  1:58 /oracle/app/11.2/grid/bin/ocssd.bin

    root 4915446       1   0 17:37:44      -  0:00 /oracle/app/11.2/grid/bin/cssdmonitor

    root 6291598       1   0 15:08:26      -  0:00 /bin/sh /oracle/app/11.2/grid/bin/ocssd

    root 5177602       1   0 17:37:44      -  0:00 /oracle/app/11.2/grid/bin/cssdagent

# ps -ef|grep css

    grid 4587534 6291598  15 15:08:26      -  1:58 /oracle/app/11.2/grid/bin/ocssd.bin

    root 4915446       1   0 17:37:44      -  0:00 /oracle/app/11.2/grid/bin/cssdmonitor

    root 6291598       1   0 15:08:26      -  0:00 /bin/sh /oracle/app/11.2/grid/bin/ocssd

    root 4194590 8323158   1 17:38:51  pts/0  0:00 grep css

    root 5177602       1   0 17:37:44      -  0:00 /oracle/app/11.2/grid/bin/cssdagent

$ crsctl stat res -t

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS       

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.ARCH.dg

               ONLINE  ONLINE       ipcrsdb2                                     

ora.DATA.dg

               ONLINE  ONLINE       ipcrsdb2                                     

ora.LISTENER.lsnr

               ONLINE  ONLINE       ipcrsdb2                                     

ora.LISTENER1.lsnr

               ONLINE  ONLINE       ipcrsdb2                                     

ora.LISTENER2.lsnr

               ONLINE  ONLINE       ipcrsdb2                                     

ora.LISTENER_IPCSMGR1.lsnr

               ONLINE  ONLINE       ipcrsdb2                                     

ora.SYSDG.dg

               ONLINE  ONLINE       ipcrsdb2                                     

ora.asm

               ONLINE  ONLINE       ipcrsdb2                                     

ora.gsd

               OFFLINE OFFLINE      ipcrsdb2                                     

ora.net1.network

               ONLINE  ONLINE       ipcrsdb2                                     

ora.ons

               ONLINE  ONLINE       ipcrsdb2                                     

ora.registry.acfs

               ONLINE  ONLINE       ipcrsdb2                                     

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

      1        ONLINE  ONLINE       ipcrsdb2                                     

ora.cvu

      1        ONLINE  ONLINE       ipcrsdb2                                     

ora.ipcrsdb1.vip

      1        ONLINE  INTERMEDIATE ipcrsdb2                 FAILED OVER         

ora.ipcrsdb2.vip

      1        ONLINE  ONLINE       ipcrsdb2                                     

ora.ipcscat.db

      1        ONLINE  OFFLINE                                                   

ora.ipcsmgr.db

      1        ONLINE  OFFLINE                                                   

      2        ONLINE  ONLINE       ipcrsdb2                 Open                

ora.ipcsqy.db

      1        ONLINE  OFFLINE                                                   

      2        ONLINE  ONLINE       ipcrsdb2                 Open                

ora.oc4j

      1        ONLINE  ONLINE       ipcrsdb2                                     

ora.scan1.vip

      1        ONLINE  ONLINE       ipcrsdb2   

从这部分信息对比可以发现,节点1数据库集群组件启动异常,HAS可以启动,CRS无法启动;

2.2 集群件异常日志分析

结合集群启动的流程,对集群件相关日志进行分析,如下:

集群ALERT日志中的信息:

2025-04-13 15:08:09.040

[ohasd(6488144)]CRS-2112:The OLR service started on node ipcrsdb1.

2025-04-13 15:08:09.081

[ohasd(6488144)]CRS-1301:Oracle High Availability Service started on node ipcrsdb1.

2025-04-13 15:08:09.093

[ohasd(6488144)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred

2025-04-13 15:08:10.748

[/oracle/app/11.2/grid/bin/oraagent.bin(5570804)]CRS-5815:Agent '/oracle/app/11.2/grid/bin/oraagent_grid' could not find any base type entry points for type 'ora.daemon.type'. Details at (:CRSAGF00108:) {0:2:2} in /oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/oraagent_grid/oraagent_grid.log.

2025-04-13 15:08:11.170

[/oracle/app/11.2/grid/bin/orarootagent.bin(5898386)]CRS-5016:Process "/oracle/app/11.2/grid/bin/acfsload" spawned by agent "/oracle/app/11.2/grid/bin/orarootagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/orarootagent_root/orarootagent_root.log"

2025-04-13 15:08:21.930

[/oracle/app/11.2/grid/bin/oraagent.bin(4129174)]CRS-5815:Agent '/oracle/app/11.2/grid/bin/oraagent_grid' could not find any base type entry points for type 'ora.daemon.type'. Details at (:CRSAGF00108:) {0:5:2} in /oracle/app/11.2/grid/log/ipcrsdb1/agent/ohasd/oraagent_grid/oraagent_grid.log.

2025-04-13 15:08:24.446

[ohasd(6488144)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2025-04-13 15:08:24.467

[gpnpd(5701712)]CRS-2328:GPNPD started on node ipcrsdb1.

2025-04-13 15:08:27.848

[cssd(4587534)]CRS-1713:CSSD daemon is started in clustered mode

2025-04-13 15:10:00.121

[cssd(4587534)]CRS-1707:Lease acquisition for node ipcrsdb1 number 1 completed

2025-04-13 15:10:01.706

[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk160; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.

2025-04-13 15:10:01.973

[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk161; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.

2025-04-13 15:10:02.245

[cssd(4587534)]CRS-1605:CSSD voting file is online: /dev/rhdisk162; details in /oracle/app/11.2/grid/log/ipcrsdb1/cssd/ocssd.log.

2025-04-13 15:10:04.209

[cssd(4587534)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ipcrsdb1 ipcrsdb2 .

此处日志可以看到CSSD部分启动是正常的。

CRSD启动日志:

2025-04-13 15:14:20.505: [ default][1] CRS Daemon Starting

2025-04-13 15:14:20.510: [ default][515] Policy Engine is not initialized yet!

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: AGENT  1

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: AGFW  0

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLSFRAME  0

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLSVER  0

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CLUCLS  0

2025-04-13 15:14:20.515: [ default][1] ENV Logging level for Module: CRSAPP  0

2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCCL  0

2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCEVT  0

2025-04-13 15:14:20.516: [ default][1] ENV Logging level for Module: CRSCOMM  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSD  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSEVT  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSMAIN  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSOCR  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSPE  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSPLACE  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSRES  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSRPT  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSRTI  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSSE  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSSEC  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSTIMER  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: CRSUI  0

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRAPI  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRASM  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRCAC  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRCLI  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRMAS  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRMSG  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCROSD  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRRAW  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRSRV  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: OCRUTL  1

2025-04-13 15:14:20.516: [    CRSD][1] ENV Logging level for Module: SuiteTes  1

2025-04-13 15:14:20.517: [    CRSD][1] ENV Logging level for Module: UiServer  0

2025-04-13 15:14:20.517: [ CRSMAIN][1] Checking the OCR device

2025-04-13 15:14:20.517: [ CRSMAIN][1] Sync-up with OCR

2025-04-13 15:14:20.517: [ CRSMAIN][1] Connecting to the CSS Daemon

2025-04-13 15:14:20.517: [ CRSMAIN][1] Getting local node number

2025-04-13 15:14:20.518: [ CRSMAIN][1] Initializing OCR

[   CLWAL][1]clsw_Initialize: OLR initlevel [70000]

2025-04-13 15:14:21.295: [  OCRRAW][1]proprioo: for disk 0 (+DATA), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (80617), lsn (80617)

2025-04-13 15:14:21.295: [  OCRRAW][1]proprioo: my id set: (1862408427, 2024014047, 0, 0, 0)

2025-04-13 15:14:21.295: [  OCRRAW][1]proprioo: 1st set: (1862408427, 1028247821, 0, 0, 0)

2025-04-13 15:14:21.295: [  OCRRAW][1]proprioo: 2nd set: (1862408427, 2024014047, 0, 0, 0)

2025-04-13 15:14:21.308: [  OCRRAW][1]proprioo: for disk 1 (+sysdg), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (80617), lsn (80617)

2025-04-13 15:14:21.308: [  OCRRAW][1]proprioo: my id set: (1862408427, 2024014047, 0, 0, 0)

2025-04-13 15:14:21.308: [  OCRRAW][1]proprioo: 1st set: (1862408427, 1028247821, 0, 0, 0)

2025-04-13 15:14:21.308: [  OCRRAW][1]proprioo: 2nd set: (1862408427, 2024014047, 0, 0, 0)

2025-04-13 15:14:21.332: [  OCRSRV][1]th_init: Successfully retrieved CSS misscount [31].

2025-04-13 15:14:21.332: [  OCRSRV][1]th_init: Successfully query CLSS mode [3].

2025-04-13 15:14:21.336: [  OCRSRV][1]th_init:1: FROM PUBDATA Node num [2]Remote Listening Port [0] Cache invalidation port [0]

2025-04-13 15:14:21.336: [  OCRSRV][1]th_init:1.1: FROM PUBDATA Node num [2]CLSC Private IP or GIPC connect string [gipcha<ipcrsdb2><b055-1450-1a83-3acf><9894-3b98-d74c-4465>]

[  OCRMAS][1]th_calc_av:5': Rturn persisted AV [186647040] [11.2.0.2.0]

2025-04-13 15:14:21.338: [  OCRSRV][1]th_init: Decided to use GIPC

2025-04-13 15:14:21.339: [GIPCHGEN][2057] gipchaInternalRegister: Initializing HA GIPC

2025-04-13 15:14:21.339: [GIPCHGEN][2057] gipchaNodeCreate: adding new node 111d02db0 { host '', haName '18b5-ba0a-c21b-d9e8', srcLuid 4aafeee2-00000000, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507492, sentRegister 0, localMonitor 0, flags 0x1 }

2025-04-13 15:14:21.339: [GIPCHTHR][2314] gipchaWorkerThread: starting worker thread hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x0 }

2025-04-13 15:14:21.339: [GIPCHDEM][2571] gipchaDaemonThread: starting daemon thread hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x0 }

2025-04-13 15:14:21.425: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 112063730 { host '', haName '18b5-ba0a-c21b-d9e8', local 0, ip '1.1.1.1', subnet '1.1.1.0', mask '255.255.255.0', mac '5c-f3-fc-ec-7e-6c', ifname 'en6', numRef 0, numFail 0, idxBoot 0, flags 0x1 }

2025-04-13 15:14:21.430: [GIPCHTHR][2314] gipchaWorkerUpdateInterface: created local interface for node 'ipcrsdb1', haName '18b5-ba0a-c21b-d9e8', inf '1.1.1.1:9930'

2025-04-13 15:14:21.628: [GIPCXCPT][2057] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '6589-7b37-1f59-419d', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)

2025-04-13 15:14:21.629: [GIPCHGEN][2057] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '6589-7b37-1f59-419d', flags 0x0

2025-04-13 15:14:21.630: [  OCRMSG][2057]prom_listen: Port str [6589-7b37-1f59-419d]

2025-04-13 15:14:21.630: [  OCRSRV][2057]proath_listen: listening to remote requests at portstr [6589-7b37-1f59-419d]

2025-04-13 15:14:21.632: [GIPCXCPT][3085] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '9a1e-23b3-a0c6-dd0a', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)

2025-04-13 15:14:21.632: [GIPCHGEN][3085] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '9a1e-23b3-a0c6-dd0a', flags 0x0

2025-04-13 15:14:21.633: [  OCRMSG][3085]prom_listen: Port str [9a1e-23b3-a0c6-dd0a]

2025-04-13 15:14:21.633: [  OCRSRV][3085]th_invalidate_cache: listening to cache_invalidation requests at portstr [9a1e-23b3-a0c6-dd0a]

[  OCRMAS][3342]th_calc_av:5': Rturn persisted AV [186647040] [11.2.0.2.0]

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1066219116

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647040

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647040

2025-04-13 15:14:21.643: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha<ipcrsdb1><6589-7b37-1f59-419d><9a1e-23b3-a0c6-dd0a>

2025-04-13 15:14:21.655: [  OCRMAS][3342]rcfg_con:1: Member [1] joined. Inc [8].

2025-04-13 15:14:21.655: [  OCRMAS][3342]proath_master: Master changing. old master [-1] new master [2] num_rcfg [1].

2025-04-13 15:14:21.657: [  OCRSRV][3342]th_not_master_change: Master change callback not registered

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1066219116

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647040

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647040

2025-04-13 15:14:21.657: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha<ipcrsdb2><b055-1450-1a83-3acf><9894-3b98-d74c-4465>

2025-04-13 15:14:21.657: [  OCRMAS][3342]th_connect_master: Using GIPC type to connect

2025-04-13 15:14:21.657: [  OCRMAS][3342]th_connect_master:10: Master host name [ipcrsdb2]

2025-04-13 15:14:21.657: [  OCRMAS][3342]proath_connect_master: Attempting to connect to master at address [ipcrsdb2:b055-1450-1a83-3acf]

2025-04-13 15:14:21.658: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '5144-3d10-a8ef-7ad5', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)

2025-04-13 15:14:21.658: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '5144-3d10-a8ef-7ad5', flags 0x0

2025-04-13 15:14:21.663: [GIPCHGEN][2571] gipchaNodeCreate: adding new node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 0, localMonitor 0, flags 0x0 }

2025-04-13 15:14:21.663: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111f3e078 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-00000000, msgId 0 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x0 }

2025-04-13 15:14:21.663: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 4200507816 ms, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x4 }

2025-04-13 15:14:21.674: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 112063dd0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', local 0, ip '1.1.1.2:38535', subnet '1.1.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 }

2025-04-13 15:14:21.683: [GIPCHTHR][2314] gipchaWorkerUpdateInterface: created remote interface for node 'ipcrsdb2', haName '44ed-6357-b445-b29c', inf 'udp://1.1.1.2:38535'

2025-04-13 15:14:21.683: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 112063dd0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', local 112063730, ip '1.1.1.2:38535', subnet '1.1.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }

2025-04-13 15:14:24.667: [GIPCHALO][2314] gipchaLowerProcessAcks: ESTABLISH finished for node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 1, lastAck 4, lastValidAck 0, sendSeq [3 : 3], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x20c }

2025-04-13 15:14:24.667: [GIPCHALO][2314] gipchaLowerProcessWaitQ: triggering deffered startup of msg 111f3e078 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-00000000, msgId 0 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 1, lastAck 4, lastValidAck 0, sendSeq [4 : 4], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x208 }

2025-04-13 15:14:24.690: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'ipcrsdb1', port '1ecb-14df-9bc8-a079', hctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x1 }, ret gipcretKeyNotFound (36)

2025-04-13 15:14:24.690: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 795]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eb3e30 [0000000000000010] { gipchaContext : host 'ipcrsdb1', name '18b5-ba0a-c21b-d9e8', luid '4aafeee2-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x1 }, host 'ipcrsdb1', port '1ecb-14df-9bc8-a079', flags 0x0

2025-04-13 15:14:24.714: [  OCRMAS][3342]proath_master: SUCCESSFULLY CONNECTED TO THE MASTER

2025-04-13 15:14:24.714: [  OCRMAS][3342]th_master: NEW OCR MASTER IS 2

2025-04-13 15:14:24.716: [  OCRSRV][1]th_snap_local_spawn: Inside snap local spawn. host is [ipcrsdb1]

2025-04-13 15:14:24.722: [ CRSMAIN][1] Running mode check...

2025-04-13 15:14:24.722: [ CRSMAIN][1] Running as user: root

2025-04-13 15:14:24.722: [ CRSMAIN][1] CRSD running as the Privileged user

2025-04-13 15:14:24.723: [ CRSMAIN][1] Checking compatibility...

2025-04-13 15:14:24.723: [  CLSVER][1] Static Version 11.2.0.2.0

2025-04-13 15:14:50.516: [ CRSMAIN][515] Policy Engine is not initialized yet!

2025-04-13 15:14:54.739: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30005, msg 11208e6f8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 9 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:14:55.737: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30003, msg 11240b438 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 12 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

*******省略输出

2025-04-13 15:15:16.862: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30001, msg 1130298d8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 54 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:17.876: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 11302c178 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 56 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:18.875: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30004, msg 11302ea18 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 58 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:19.885: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30007, msg 1130312b8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 60 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:20.516: [ CRSMAIN][515] Policy Engine is not initialized yet!

2025-04-13 15:15:20.891: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30003, msg 113033b58 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 62 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

*******省略输出

2025-04-13 15:15:49.134: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 11301fed8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 118 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:50.139: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 113022778 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 120 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:50.515: [ CRSMAIN][515] Policy Engine is not initialized yet!

2025-04-13 15:15:51.152: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 113025018 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 122 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:52.160: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30010, msg 1130278b8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 124 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:15:53.173: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 11302a158 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 126 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

*******省略输出

016-07-13 15:16:19.353: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30011, msg 11240b378 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 178 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:20.347: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 11304da18 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 180 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:20.518: [ CRSMAIN][515] Policy Engine is not initialized yet!

2025-04-13 15:16:21.353: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30002, msg 1130174f8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 182 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:22.368: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 113019d98 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 184 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:23.373: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30006, msg 11301c638 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 186 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:24.380: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 11301eed8 { len 1576, seq 6, type gipchaHdrTypeRecv (2), lastSeq 7, lastAck 4, minAck 5, flags 0x5, srcLuid 7abb2ced-203d3ac3, dstLuid 4aafeee2-08f74232, msgId 188 }, node 1120639f0 { host 'ipcrsdb2', haName '44ed-6357-b445-b29c', srcLuid 4aafeee2-08f74232, dstLuid 7abb2ced-203d3ac3 numInf 1, contigSeq 4, lastAck 7, lastValidAck 4, sendSeq [7 : 7], createTime 4200507816, sentRegister 1, localMonitor 0, flags 0x8 }

2025-04-13 15:16:50.519: [ CRSMAIN][515] Policy Engine is not initialized yet!

2025-04-13 15:17:20.519: [ CRSMAIN][515] Policy Engine is not initialized yet!

从CRSD.LOG中可以发现一些异常的信息,在下一部分进行详解。

2.3 故障分析

通过对集群状态及集群日志的检查分析,并参考ORACLE官方文档相关信息,可以初步定位到集群启动到CRSD时出现异常;对可能导致集群启动CRSD异常的各种原因(如OCR无法访问、GPNP PROFILE异常、PID文件异常、OCR文件异常、私网Jumbo Frame、多播、主机特定版本BUG等),进行逐一排查,初步定位到是集群组件启动时网络通信方面异常引起此问题。

通过对网络方面排查,并与客户工程师进行沟通,了解到近期对RAC节点AIX主机层面的IPSEC进行了安全加固;最终经过测试发现是因为此原因导致RAC只有一个节点能够启动,在另一节点加入集群时会因通信异常导致此节点集群件无法启动。

按照ORACLE官方建议,同时停止两个节点的集群件;对主机层面的安全加固策略进行清理后,再启动两个RAC节点后恢复正常。

参考文档:

Grid Infrastructure 启动问题 (文档 ID 1623340.1)

Grid Infrastructure 启动的五大问题 (文档 ID 1526147.1)

AIX: After Installing PowewSC: CRSD on the 2nd node is not able to Start with "proath_master:100b: Polling, connect to master not complete" (文档 ID 2056429.1)中。

  • 总结与建议

3.1 后续建议

1.建议关闭AIX主机上的IPSEC安全加固配置;同时对于其它AIX+RAC架构的环境,对安全加固相关配置进行检查。

2.现场沟通了解到当前的系统架构是AIX+两节点RAC的架构,未配置有DATAGUARD或者其它方式的容灾;建议参考ORACLE官方对于MAA架构(ORACLE Maximum Availability Architecture)的建议,通过配置单节点或两节点RAC 做DATAGURD来实现数据库层面的最高可用(或者通过第三方产品(OGG等)来提高数据库系统的可用性)。

相关文章:

  • Windows更新暂停七天关键注册表
  • centos中postfix的作用
  • 基于多目标进化算法的神经网络架构搜索及其高级可视化技术
  • Spring的Validation,这是一套基于注解的权限校验框架
  • 视网膜屏幕:重新定义数字显示的革命性技术
  • Windows系统Anaconda/Miniconda的安装、配置、基础使用、清理缓存空间和Pycharm/VSCode配置指南
  • 打破边界,智评未来:AI如何重塑学科交叉融合的评价体系?
  • QT-1.信号与槽
  • 入门OpenTelemetry——可观测性与链路追踪介绍
  • 【周输入】510周阅读推荐-2
  • [python] Python单例模式:__new__与线程安全解析
  • 通过迁移学习改进深度学习模型
  • 动态规划问题 -- 多状态模型(粉刷房子)
  • Milvus 全面解析
  • python 练习 五
  • 深入理解Java HotSpot中的即时编译
  • PADS入门笔记【一】
  • 【Python】对象生命周期全解析
  • Vue 3 打开 el-dialog 时使 el-input 获取焦点
  • 使用大语言模型从零构建知识图谱(上)
  • 由我国牵头制定,适老化数字经济国际标准发布
  • “80后”北京市东城区副区长王智勇获公示拟任区委常委
  • 牛市早报|中方调整对美加征关税措施,五部门约谈外卖平台企业
  • 独行侠以1.8%概率获得状元签,NBA原来真的有剧本?
  • 老镇老宅楼:破旧,没产证,要不要更新?
  • 李公明 | 一周画记:印巴交火会否升级为第四次印巴战争?