当前位置: 首页 > news >正文

ORACLE数据库实例报错ORA-00470: LGWR process terminated with error宕机问题分析报告

  • 服务概述

10月21号03:22分,BOSS数据库实例发生异常宕机;工程师及时响应此问题并对此故障原因进行分析及相关建议,详细的故障情况及相关日志、TRACE文件的分析及总结、建议,请参阅本文档。

  • hzboss数据库实例宕机分析

4.1 数据库层面日志的分析

4.1.1故障数据库实例alert日志

数据库实例的Alert日志

对故障时间点附近的ALERT日志分析如下:

Wed Oct 21 03:06:22 2024
Thread 1 advanced to log sequence 254206 (LGWR switch)
  Current log# 9 seq# 254206 mem# 0: /boss_sysdata/oradata/dtvboss/redo09.log
Wed Oct 21 03:17:10 2024
WARNING: aiowait timed out 1 times --该时间段系统出现aio等待的告警,该告警由_iowait_timeouts隐含参数控制,超过该参数设置的值后会导致系统宕机,这里并未超过系统设置值。但是提示我们当时系统在io调度上出现了问题。
Wed Oct 21 03:22:11 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分17秒系统抛错,进程367持有controlfile的enqueue lock时间被其他进程(arc0)等待超时,该enqueue lock持有时间由隐含参数_controlfile_enqueue_timeout控制,默认为900秒
Wed Oct 21 03:22:17 2024
System State dumped to trace file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Wed Oct 21 03:22:50 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分50秒,进程367持有的controlfile的enqueue lock时间被其他进程(lgwr)等待超时抛错
Wed Oct 21 03:22:55 2024
Killing enqueue blocker (pid=367) on resource CF-00000000-00000000 --LGWR进程发现异常查杀持有controlfile equeue lock的进程367
 by killing session 1648.1
Wed Oct 21 03:22:55 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-07445: exception encountered: core dump [ksuklms()+672] [SIGSEGV] [Address not mapped to object] [0x000000062] [] []
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc: --后续一系列报错宕机开始
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:18 2024
PMON: terminating instance due to error 470
Wed Oct 21 03:23:18 2024
ORA-470 encountered when generating server alert SMG-3503
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_j006_22335.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:22 2024
Termination issued to instance processes. Waiting for the processes to exit
Wed Oct 21 03:23:28 2024
Instance termination failed to kill one or more processes

4.1.2 故障时trace文件分析

故障时,节点数据库实例的alert日志中提到了如下TRACE文件:dtvboss_arc0_404.trc

分析如下:

/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

ORACLE_HOME = /oracle/product/10.2.0/db_1

System name: SunOS

Node name: hzboss-ora1

Release: 5.10

Version: Generic_144488-12

Machine: sun4u

Instance name: dtvboss

Redo thread mounted by this instance: 1

Oracle process number: 23

Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)

*** SERVICE NAME:(SYS$BACKGROUND) 2024-10-21 03:22:11.264

*** SESSION ID:(1646.3) 2024-10-21 03:22:11.264

*** 2024-10-21 03:22:11.264

Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)

Possible local blocker ospid=367 sid=1648 sser=1 time_held=37 secs (ges mode req=6 held=4)

DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER

  possible owner[17.367] on resource CF-00000000-00000000

Dumping process info of pid[17.367] requested by pid[23.404]

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367'

-------------------------------------------------------------------------------

ENQUEUE [CF] HELD FOR TOO LONG

 

enqueue holder: 'inst 1, osid 367'

 

Process 'inst 1, osid 367' is holding an enqueue for maximum allowed time.

The process will be terminated.

 

Oracle Support Services triaging information: to find the root-cause, look

at the call stack of process 'inst 1, osid 367' located below. Ask the

developer that owns the first NON-service layer in the stack to investigate.

Common service layers are enqueues (ksq), latches (ksl), library cache

pins and locks (kgl), and row cache locks (kqr).

 

Dumping process 17.367 info:

*** 2024-10-21 03:22:11.300

---------------------------------------------------------------

从这一段信息可以看到,此TRACE文件是ARC0进程的TRACE,ARC0进程的ORACLE SID是23,OS PID是404;

ARC0进程无法获取资源:CF-00000000-00000000 (controlfile enqueue 队列锁),此资源的持有者是pid[17.367]

ARC0进程的TRACE分析

PROCESS 23:

  ----------------------------------------

  SO: 8a9004f88, type: 2, owner: 0, flag: INIT/-/-/0x00

  (process) Oracle pid=23, calls cur/top: 8a95c6650/8a95c6650, flag: (2) SYSTEM

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 24

              last post received-location: ksasnd

              last process to post me: b66002018 1 6

              last post sent: 0 0 24

              last post sent-location: ksasnd

              last process posted by me: b66005770 1 2

    (latch info) wait_event=0 bits=0

    Process Group: DEFAULT, pseudo proc: 5eb3388f0

    O/S info: user: oracle, term: UNKNOWN, ospid: 404

    OSD pid info: Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)

Dump of memory from 0x00000005F033B1F0 to 0x00000005F033B3F8

5F033B1F0 0000000C 00000000 00000008 AF739C70  [.............s.p]

………………………………

5F033B3E0 00000005 000313A7 00000008 AF739C70  [.............s.p]

5F033B3F0 00000006 000313A7                    [........]        

    (FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control03.ctl

     fno=2 lblksz=16384 fsiz=1308

    (FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control02.ctl

     fno=1 lblksz=16384 fsiz=1308

    (FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control01.ctl

     fno=0 lblksz=16384 fsiz=1308

    ----------------------------------------

    SO: 8a9778500, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00

    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,

                       event: 21, last message event: 21,

                       last message waited event: 21, messages read: 0

                       channel: (5eb52cc20) scumnt mount lock

                                scope: 1, event: 22, last mesage event: 20,

                                publishers/subscribers: 0/20,

                                messages published: 1

    ----------------------------------------

    SO: 8a95b7328, type: 4, owner: 8a9004f88, flag: INIT/-/-/0x00

    (session) sid: 1646 trans: 0, creator: 8a9004f88, flag: (51) USR/- BSY/-/-/-/-/-

              DID: 0001-0017-0000000D, short-term DID: 0000-0000-00000000

              txn branch: 0

              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

    service name: SYS$BACKGROUND

    last wait for 'enq: CF - contention' blocking sess=0x0 seq=43322 wait_time=2929723 seconds since wait started=36

                name|mode=43460004, 0=0, operation=0

    Dumping Session Wait History

     for 'enq: CF - contention' count=1 wait_time=2929723

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929725

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929726

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929726

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929725

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929723

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929725

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929726

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929724

                name|mode=43460004, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929750

                name|mode=43460004, 0=0, operation=0

    temporary object counter: 0

      ----------------------------------------

      UOL used : 0 locks(used=0, free=0)

      KGX Atomic Operation Log 5eaffc848

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 1646 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 5eaffc890

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 1646 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 5eaffc8d8

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 1646 efd 0 whr 0 slp 0

      ----------------------------------------

      SO: 8afaa4330, type: 41, owner: 8a95b7328, flag: INIT/-/-/0x00

      (dummy) nxc=0, nlb=0   

    ----------------------------------------

    SO: 8a97783d8, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00

    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,

                       event: 24, last message event: 27,

                       last message waited event: 27, messages read: 1

                       channel: (5eb528a48) system events broadcast channel

                                scope: 2, event: 2923059, last mesage event: 27,

                                publishers/subscribers: 0/421,

                                messages published: 1

    ----------------------------------------

    SO: 8a95c6650, type: 3, owner: 8a9004f88, flag: INIT/-/-/0x00

    (call) sess: cur 8a95b7328, rec 8a9591bc8, usr 8a95b7328; depth: 0

      ----------------------------------------

      SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, lock_flag: 0x0

      own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8

      ----------------------------------------

      SO: 8a9591bc8, type: 4, owner: 8a95c6650, flag: INIT/-/-/0x00

      (session) sid: 1618 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-

                DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000

                txn branch: 0

                oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

      temporary object counter: 0

    ----------------------------------------

    SO: 8a97ad460, type: 16, owner: 8a9004f88, flag: INIT/-/-/0x00

    (osp req holder)

此段信息可以看到,ARC0进程等待controlfile enqueue 队列锁。等待事件为enq: CF - contention

 

SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, lock_flag: 0x0

      own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8

这里说明arc0在申请的资源是:res:0x5eb40f608

而在process 17号的ckpt中可以看到0x5eb40f608资源被ckptmode: SSX方式持有持有

CKPT进程的TRACE分析

PROCESS 17:

  ----------------------------------------

  SO: 8a90037d0, type: 2, owner: 0, flag: INIT/-/-/0x00

  (process) Oracle pid=17, calls cur/top: 8a95c5df8/8a95c5df8, flag: (6) SYSTEM

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 114

              last post received-location: kcbbza

              last process to post me: 8a9001830 1 6

              last post sent: 0 0 9

              last post sent-location: ksqrcl

              last process posted by me: b66011d18 91 0

    (latch info) wait_event=0 bits=0

    Process Group: DEFAULT, pseudo proc: 5eb3388f0

    O/S info: user: oracle, term: UNKNOWN, ospid: 367

    OSD pid info: Unix process pid: 367, image: oracle@hzboss-ora1 (CKPT)

Dump of memory from 0x00000005EF308A18 to 0x00000005EF308C20

5EF308A10                   0000000C 00000000          [........]

…………………………………………

    ----------------------------------------

    SO: 8a95b9df8, type: 4, owner: 8a90037d0, flag: INIT/-/-/0x00

    (session) sid: 1648 trans: 0, creator: 8a90037d0, flag: (51) USR/- BSY/-/-/-/-/-

              DID: 0001-0011-00000005, short-term DID: 0000-0000-00000000

              txn branch: 0

              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

    service name: SYS$BACKGROUND

    waiting for 'direct path write' blocking sess=0x0 seq=39369 wait_time=0 seconds since wait started=37

                file number=72, first dba=1, block cnt=1

……………………………………

    ----------------------------------------

    SO: 8a9777a30, type: 11, owner: 8a90037d0, flag: INIT/-/-/0x00

    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a90037d0,

                       event: 14, last message event: 27,

                       last message waited event: 27, messages read: 1

                       channel: (5eb528a48) system events broadcast channel

                                scope: 2, event: 2923059, last mesage event: 27,

                                publishers/subscribers: 0/421,

                                messages published: 1

    ----------------------------------------

    SO: 8a95c5df8, type: 3, owner: 8a90037d0, flag: INIT/-/-/0x00

    (call) sess: cur 8a95b9df8, rec 8a95b5dc0, usr 8a95b9df8; depth: 0

      ----------------------------------------

      SO: 8a95d1f20, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000004 DID: 0001-0011-00000005

      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb507858, mode: S, lock_flag: 0x0

      own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb507868

      ----------------------------------------

      SO: 8a95d1cc0, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0001-0011-00000005

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, mode: SSX, lock_flag: 0x0

      own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f618

      ----------------------------------------

      SO: 8a95b5dc0, type: 4, owner: 8a95c5df8, flag: INIT/-/-/0x00

      (session) sid: 1645 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-

                DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000

                txn branch: 0

                oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

      temporary object counter: 0

      ----------------------------------------

      SO: 8a95cb468, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00

      (enqueue) RS-00000019-00000001 DID: 0001-0011-00000005

      lv: 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f8c8, mode: SS, lock_flag: 0x0

      own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f8d8

      ----------------------------------------

      SO: 8a95cb338, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0000-0011-00000004

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, mode: SS, lock_flag: 0x0

      own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x8a95d1cf0

    ----------------------------------------

    SO: 8a97ad2c8, type: 16, owner: 8a90037d0, flag: INIT/-/-/0x00

    (osp req holder)

此处可以看到CKPT进程持有(enqueue) CF-00000000-00000000,持有模式是:res: 0x5eb40f608, mode: SSX, lock_flag: 0x0,可以对应到ARC0进程请求的资源。

LGWR进程的TRACE分析

PROCESS 10:

  ----------------------------------------

  SO: b66002018, type: 2, owner: 0, flag: INIT/-/-/0x00

  (process) Oracle pid=10, calls cur/top: b665c6de0/b665c6de0, flag: (6) SYSTEM

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 24

              last post received-location: ksasnd

              last process to post me: 8a9000860 1 6

              last post sent: 109 0 4

              last post sent-location: kslpsr

              last process posted by me: 8a902ef90 105 0

    (latch info) wait_event=0 bits=0

    Process Group: DEFAULT, pseudo proc: 5eb3388f0

    O/S info: user: oracle, term: UNKNOWN, ospid: 365

    OSD pid info: Unix process pid: 365, image: oracle@hzboss-ora1 (LGWR)

Dump of memory from 0x00000005EF308810 to 0x00000005EF308A18

5EF308810 00000015 00000000 0000000B 6C1C1158  [............l..X]

  ………………

    ----------------------------------------

    SO: b665cc938, type: 5, owner: b66002018, flag: INIT/-/-/0x00

    (enqueue) RT-00000001-00000000 DID: 0000-000A-00000005

    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

    res: 0x5eb40f828, mode: X, lock_flag: 0x0

    own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f838

    (FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control03.ctl

     fno=2 lblksz=16384 fsiz=1308

    (FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control02.ctl

     fno=1 lblksz=16384 fsiz=1308

    (FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0

     fname=/boss_sysdata/oradata/dtvboss/control01.ctl

     fno=0 lblksz=16384 fsiz=1308

    ----------------------------------------

    SO: b66778b58, type: 11, owner: b66002018, flag: INIT/-/-/0x00

    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,

                       event: 12, last message event: 20,

                       last message waited event: 20, messages read: 1

                       channel: (5eb52cc20) scumnt mount lock

                                scope: 1, event: 22, last mesage event: 20,

                                publishers/subscribers: 0/20,

                                messages published: 1

    ----------------------------------------

    SO: b665bf398, type: 4, owner: b66002018, flag: INIT/-/-/0x00

    (session) sid: 824 trans: 0, creator: b66002018, flag: (51) USR/- BSY/-/-/-/-/-

              DID: 0001-000A-00000006, short-term DID: 0000-0000-00000000

              txn branch: 0

              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

    service name: SYS$BACKGROUND

    waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0

                name|mode=43460005, 0=0, operation=0

    Dumping Session Wait History

     for 'enq: CF - contention' count=1 wait_time=2929719

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929716

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929716

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929718

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929719

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929719

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929713

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929715

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929722

                name|mode=43460005, 0=0, operation=0

     for 'enq: CF - contention' count=1 wait_time=2929736

                name|mode=43460005, 0=0, operation=0

    temporary object counter: 0

      ----------------------------------------

      UOL used : 0 locks(used=0, free=0)

      KGX Atomic Operation Log 5ebff7cd0

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 824 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 5ebff7d18

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 824 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 5ebff7d60

       Mutex 0(0, 0) idn 0 oper NONE

       Library Cache uid 824 efd 0 whr 0 slp 0

      ----------------------------------------

      SO: b6caa57e0, type: 41, owner: b665bf398, flag: INIT/-/-/0x00

      (dummy) nxc=0, nlb=0   

    ----------------------------------------

    SO: b66778a48, type: 11, owner: b66002018, flag: INIT/-/-/0x00

    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,

                       event: 13, last message event: 27,

                       last message waited event: 27, messages read: 1

                       channel: (5eb528a48) system events broadcast channel

                                scope: 2, event: 2923059, last mesage event: 27,

                                publishers/subscribers: 0/421,

                                messages published: 1

    ----------------------------------------

    SO: b665c6de0, type: 3, owner: b66002018, flag: INIT/-/-/0x00

    (call) sess: cur b665bf398, rec 0, usr b665bf398; depth: 0

      ----------------------------------------

      SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0001-000A-00000006

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, req: SSX, lock_flag: 0x0

      own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628

    ----------------------------------------

    SO: b667ae720, type: 16, owner: b66002018, flag: INIT/-/-/0x00

    (osp req holder)

从LGWR进程的TRACE中可以看到:waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0

说明LGWR进程也在等待controlfile enqueue 队列锁,等待事件为enq: CF  contention。

SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00

      (enqueue) CF-00000000-00000000 DID: 0001-000A-00000006

      lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2

      res: 0x5eb40f608, req: SSX, lock_flag: 0x0

      own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628

从此段可以看到,LGWR进程请求的CF锁资源是0x5eb40f608,并且是以SSX模式请求;通过CKPT进程的TRACE可以看到其已经own: 0x8a95b9df8 持有此资源res: 0x5eb40f608, mode: SSX, lock_flag: 0x0 SSX模式持有此锁;因此LGWR进程也处于等待此资源的状态。

4.2 主机负载情况分析

从ALERT日志中提到的故障发生时的TRACE文件:

/oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc

/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc

中可以找到故障发生时的主机负载情况:

*** 2024-10-21 03:22:11.300

Dumping diagnostic information for ospid 367:

OS pid = 367

loadavg : 0.44 0.62 1.82

swap info: free_mem = 1435.35M rsv = 59461.24M

           alloc = 55941.14M avail = 55124.91 swap_free = 58645.01M

Dumping diagnostic information for oracle@hzboss-ora1:

OS pid = 24887

loadavg : 1.91 1.01 1.84

swap info: free_mem = 7661.14M rsv = 53520.42M

           alloc = 50905.74M avail = 62991.89 swap_free = 65606.57M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 S   oracle 24887     1   0  39 20        ? 4115427        ?   Sep 24 ?          29:24 oracledtvboss (LOCAL=NO)

pstack: cannot examine 24887: no such process

*** 2024-10-21 03:23:53.644

ksuitm_check: OS PID=8690 is still alive

*** 2024-10-21 03:23:53.644

Dumping diagnostic information for oracle@hzboss-ora1 (J002):

OS pid = 8690

loadavg : 1.93 1.02 1.84

swap info: free_mem = 7660.87M rsv = 53524.68M

           alloc = 50905.95M avail = 62987.56 swap_free = 65606.30M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O   oracle  8690     1   2  39 20        ? 4121237          00:57:24 ?          29:43 ora_j002_dtvboss

pstack: cannot examine 8690: no such process

*** 2024-10-21 03:23:54.751

ksuitm_check: OS PID=21430 is still alive

*** 2024-10-21 03:23:54.751

Dumping diagnostic information for oracle@hzboss-ora1 (J003):

OS pid = 21430

loadavg : 1.94 1.02 1.84

swap info: free_mem = 7660.27M rsv = 53521.18M

           alloc = 50906.49M avail = 62991.00 swap_free = 65605.69M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O   oracle 21430     1   2  39 20        ? 4119652          02:59:44 ?           0:27 ora_j003_dtvboss

pstack: cannot examine 21430: unanticipated system error

*** 2024-10-21 03:23:57.839

ksuitm_check: OS PID=367 is still alive

*** 2024-10-21 03:23:57.840

Dumping diagnostic information for oracle@hzboss-ora1 (CKPT):

OS pid = 367

loadavg : 1.98 1.04 1.84

swap info: free_mem = 7660.46M rsv = 53520.98M

           alloc = 50906.02M avail = 62990.88 swap_free = 65605.85M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 S   oracle   367     1   0  39 20        ? 4116160        ?   Mar 30 ?         498:14 ora_ckpt_dtvboss

pstack: cannot examine 367: no such process

*** 2024-10-21 03:23:58.926

从这些负载信息可以看到,整个系统当时的空闲内存和swap还留有余地,足够当时系统的使用,所以内存不足的原因可以排除。

4.3 数据库版本情况

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

ORACLE_HOME = /oracle/product/10.2.0/db_1

System name: SunOS

Node name: hzboss-ora1

Release: 5.10

Version: Generic_144488-12

Machine: sun4u

Instance name: dtvboss

Redo thread mounted by this instance: 1

Oracle process number: 2

COMP_ID    COMP_NAME                           VERSION         STATUS   MODIFIED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

---------- ----------------------------------- --------------- -------- -----------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

SDO        Spatial                             10.2.0.4.0      VALID    18/FEB/2011:21:05:30                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

ORDIM      Oracle.interMedia                   10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

AMD        OLAP.Catalog                        10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

EM         Oracle.Enterprise.Manager           10.2.0.4.0      VALID    18/FEB/2011:21:04:33                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

XDB        Oracle.XML.Database                 10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

CONTEXT    Oracle.Text                         10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

EXF        Oracle.Expression.Filter            10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

RUL        Oracle.Rule.Manager                 10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

OWM        Oracle.Workspace.Manager            10.2.0.4.3      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

ODM        Oracle.Data.Mining                  10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

CATALOG    Oracle.Database.Catalog.Views       10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

CATPROC    Oracle.Database.Packages.and.Types  10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

JAVAVM     JServer.JAVA.Virtual.Machine        10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

XML        Oracle.XDK                          10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

CATJAVA    Oracle.Database.Java.Packages       10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

APS        OLAP.Analytic.Workspace             10.2.0.4.0      VALID    18/FEB/2011:21:05:29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

XOQ        Oracle.OLAP.API                     10.2.0.4.0      VALID    18/FEB/2011:21:05:29

  • 总结与建议

针对此次数据库实例宕机问题,总结及建议如下:

5.1 问题分析汇总

5.1.1 宕机问题总结

综上可以得出结论,由于arc0进程要写日志需要等待控制文件CF锁,lgwr估计要写日志或者切换也在等待控制文件CF锁,而ckpt进程持有controlfile enqueue lock时间过长,arch0等待超过900秒,导致oracle数据库系统判断异常。

此时oracle数据库系统会让lgwr进程kill持有controlfile enqueue lock的进程,相当于lgwr进程杀了ckpt进程,而ckpt进程又为oracle关键进程,连锁反应,引起宕机。

5.1.2 宕机的问题根源判断

在宕机前系统发出告警aiowait timed out 1 times,初步判断系统在当时的io存在一些问题。而当时系统的内存和swap资源都充足没有出现严重问题,trace中的loadavg : 0.44 0.62 1.82则说明了当时的cpu负载低,这里排除了内存和cpu的问题. 

初步建议检测当时的redo日志的切换和产生量,当时的数据库服务器的io调用频繁度,因此可以参考下数据库的2点到3点间的awr信息(如果有数据库监控系统那么则可以分析到故障发生时间点的io信息,但是当前环境未部署)。

controlfile equeue lock的超时管理机制中,受_kill_controlfile_enqueue_blocker=false,_controlfile_enqueue_timeout,_kill_enqueue_blocker这3个隐含参数的影响,当前环境中我不建议增加_controlfile_enqueue_timeout,而是建议在通过控制_kill_controlfile_enqueue_blocker,_kill_enqueue_blocker参数来实现该机制的管理,以避免宕机的风险。


控制队列锁管理机制的bug以及避免:

控制文件队列锁有一个相关的匹配Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4',这里可以通过以下2个参数规避。

_kill_controlfile_enqueue_blocker值分为true和false,默认为true,如果设置为false则不会去kill超时的进程,此时_kill_controlfile_enqueue_blocker=false参数失效。

_kill_enqueue_blocker值分为别0,1,2,3. 当设置为1的时候,保证后台进程不会被kill,而只会kill前台进程,这样可以避免宕机。

5.2 建议

5.2.1在数据库层上的调整建议

经过沟通当时正在跑批,所以当时的日志切换会比较频繁,因此我们建议对此时的redo log的大小匹配处理进行调整,适当增加日志组以及日志文件大小。控制日志切换保持在20-30分钟一次。

同时建议使用数据库监控工具,以便能在故障时候能及时获取到故障前的监控数据。

在参数调整角度上我们可以将_kill_enqueue_blocker设置为1或者将_kill_controlfile_enqueue_blocker设置为false。

5.2.2在系统层上的调整建议

同时检查系统io是否存在问题,

同时建议部署io层面的监控工具。

相关文章:

  • Java 原生网络编程(BIO | NIO | Reactor 模式)
  • 下集:一条打包到底的静态部署之路
  • hadoop.proxyuser.代理用户.授信域 用来干什么的
  • jvm安全点(二)openjdk17 c++源码垃圾回收安全点信号函数处理线程阻塞
  • 免费代理IP服务有哪些隐患?如何安全使用?
  • 中国各省供水量、用水量相关数据(2003-2022年)-社科数据
  • (3)python爬虫--Xpath
  • Python爬虫之路(14)--playwright浏览器自动化
  • uniapp -- uCharts 仪表盘刻度显示 0.9999999 这样的值问题处理。
  • 卸载和安装JDK
  • 电商项目-品牌管理微服务开发
  • Jackson使用详解
  • 代码随想录算法训练营第四十二四十三天
  • 提示词工程框架:CoT、ToT、GoT、PoT( 链式提示)
  • 磁盘I/O子系统
  • Scrapy进阶实践指南:从脚本运行到分布式爬取
  • PyQt5基本窗口控件(QSlider(滑动条))
  • 深入解析:如何基于开源OpENer开发EtherNet/IP从站服务
  • 高频面试题(含笔试高频算法整理)基本总结回顾110
  • 使用Spring Boot和Spring Security构建安全的RESTful API
  • 嫩黑线货物列车脱轨致1名路外人员死亡,3人被采取刑事强制措施
  • 上海国际碳中和博览会下月举办,首次打造民营经济专区
  • 关税互降后的外贸企业:之前暂停的订单加紧发货,后续订单考验沟通谈判能力
  • 中国首艘海洋级智能科考船“同济”号试航成功,可搭载水下遥控机器人
  • 普京调整俄陆军高层人事任命
  • 董军在第六届联合国维和部长级会议上作大会发言