【GaussDB】深度解析:创建存储过程卡死且无法Kill会话的疑难排查
一、现象
环境与问题描述
内核版本
集中式 gaussdb (GaussDB Kernel 506.0.0.SPC0100 build e324981f) compiled at 2025-04-27 14:27:52 last mr 23420 release
安装方式
使用 TPOPS 下发,参数均保持默认
问题描述
一个文本文件,里面是一个 create package 的代码,有一万多行,有使用很多 PL/SQL 语法特性,package里面的 SQL 也很多,有的 SQL 有几百行。通过 gsql 运行这个文件,持续两天没有执行完。
由于已经定位到了问题,为方便本地复现演示,本文将用例简化,并且手动解压内核包安装了实例,调整参数和 TPOPS 下发的一致,模拟当时排查问题的过程。
复现问题的存储过程
存储过程代码:
createprocedure test_p_union is
cur sys_refcursor;
begin
open cur for
select1from dual
unionall
selectnullfrom(select a from(select a from(selectnull a from dual )));
end;/
在 gsql 中执行,会话被卡住,按 Ctrl+C 也无法取消。
gsql ((GaussDB Kernel 506.0.0.SPC0100 build e324981f) compiled at 2025-04-27 14:27:52 last mr 23420 release)
Non-SSL connection (SSL connectionis recommended when requiring high-security)
Type"help"for help.gaussdb=# create procedure test_p_union is
gaussdb$# cur sys_refcursor;
gaussdb$# begin
gaussdb$# open cur for
gaussdb$# select 1 from dual
gaussdb$# union all
gaussdb$# select null from
gaussdb$# (select a from
gaussdb$# (select a from
gaussdb$# (select null a from dual
gaussdb$# )
gaussdb$# )
gaussdb$# );
gaussdb$# end;
gaussdb$# /
^CCancel request sent
^CCancel request sent
^CCancel request sent
^CCancel request sent
二、诊断过程
1. 查看会话状态
在另一个连接中查看会话状态,发现会话处于 active 状态,并且没有被任何会话阻塞。
gaussdb=# select pid,sessionid,state,query from pg_stat_activity where state='active' and application_name='gsql';pid | sessionid | state | query
-----------------+-----------+--------+----------------------------------------------------------------------------------------------------------140567079548672 | 7 | active | select pid,sessionid,state,queryfrom pg_stat_activity where state='active'and application_name='gsql';140567529387776 | 2 | active | createprocedure test_p_union is +| | | cur sys_refcursor; +| | | begin +| | | open cur for +| | | select1from dual +| | | unionall +| | | selectnullfrom +| | | (select a from +| | | (select a from +| | | (selectnull a from dual +| | | ) +| | | ) +| | | ); +| | | end; +| | | +| | |
(2 rows)gaussdb=# select * from pg_thread_wait_status where tid=140567529387776 and sessionid=2;node_name | db_name | thread_name | query_id | tid | sessionid | lwtid | psessionid | tlevel | smpid | wait_status | wait_event | locktag |
lockmode | block_sessionid | global_sessionid
-----------+----------+-------------+------------------+-----------------+-----------+---------+------------+--------+-------+-------------+------------+---------+-
---------+-----------------+------------------primary | postgres | gsql | 5629499534213636 | 140567529387776 | 2 | 2014538 | | 0 | 0 | none | none | || | 0:0#0
2. 尝试 Kill 会话
尝试使用 pg_terminate_session、pg_terminate_backend 和 alter system kill session 等多种方式 Kill 会话,均告失败。会话依然存在。
gaussdb=# select pg_terminate_session(140567529387776,2);
WARNING: tid: 140567529387776, sid: 2 still exists and beginto send sigusr1 tocloseclient socket
CONTEXT: referencedcolumn: pg_terminate_sessionpg_terminate_session
----------------------t
(1row)gaussdb=# select pg_terminate_backend(140567529387776);
WARNING: tid 140567529387776, sid2 still existsandbeginto send sigusr1 tocloseclient socket
CONTEXT: referencedcolumn: pg_terminate_backend
WARNING: tid 140567529387776and sessionid 2donotmatchwith valid active session
CONTEXT: referencedcolumn: pg_terminate_backendpg_terminate_backend
----------------------t
(1row)gaussdb=# alter system kill session '140567529387776,0' immediate;
WARNING: tid 140567529387776, sid2 still existsandbeginto send sigusr1 tocloseclient socket
CONTEXT: referencedcolumn: pg_terminate_backend
WARNING: tid 140567529387776and sessionid 2donotmatchwith valid active session
CONTEXT: referencedcolumn: pg_terminate_backendpg_terminate_backend
----------------------t
(1row)
gaussdb=# select pid,sessionid,state,query from pg_stat_activity where state='active' and application_name='gsql';pid | sessionid | state | query
-----------------+-----------+--------+----------------------------------------------------------------------------------------------------------
140567044945664 | 7 | active | select pid,sessionid,state,queryfrom pg_stat_activity where state='active'and application_name='gsql';140567529387776 | 2 | active | createprocedure test_p_union is +| | | cur sys_refcursor; +| | | begin +| | | open cur for +| | | select1from dual +| | | unionall +| | | selectnullfrom +| | | (select a from +| | | (select a from +| | | (selectnull a from dual +| | | ) +| | | ) +| | | ); +| | | end; +| | | +| | |
(2 rows)
3. 关联数据库会话与操作系统线程
通过查询 pg_catalog.pg_os_threads 视图,确认了高 CPU 占用的线程 2014538 正是执行卡死 SQL 的会话所对应的操作系统线程。
gaussdb=# select * from pg_catalog.pg_os_threads where pid=140567529387776;node_name | pid | lwpid | thread_name | creation_time | dbid
-----------+-----------------+---------+-------------+-------------------------------+-------primary | 140567529387776 | 2014538 | gsql | 2025-06-25 10:20:13.740054+08 | 13130
(1 row)
4. 查看日志
查看数据库日志,没有发现与此问题直接相关的错误或警告信息。
2025-06-2510:35:02.544 primary gaussdb506 postgres localhost
1405658694919680[0:0#0] 0 gs_perf 000000 [BACKEND] LOG: 00000: Auto
perf record failed. Send comman d failed. 2025-06-2510:35:02.544
primary gaussdb506 postgres localhost 1405658694919680[0:0#0] 0
gs_perf 000000 [BACKEND] LOCATION: perf_main, instr_perf_main.cpp:343
2025-06-2510:35:02.955 primary [unknown] [unknown] localhost
1405640916313600[0:0#0] 0 [unknown] 000000 [BACKEND] LOG: 00000:
Start autovacuum on database “postgres” 2025-06-2510:35:02.955 primary
[unknown] [unknown] localhost 1405640916313600[0:0#0] 0 [unknown]
000000 [BACKEND] LOCATION: autovac_worker_main, autovacuum.cpp: 1759
2025-06-2510:35:03.008 primary gaussdb506 postgres localhost
1405640916313600[0:0#0] 0 [unknown] 220230 [BACKEND] LOG: 22023: Spc
open failed. spc_node is: 1663, db_node is 13130
2025-06-2510:35:03.008 primary gaussdb506 postgres localhost
1405640916313600[0:0#0] 0 [unknown] 220230 [BACKEND] CAUSE: segment
is not initialized. 2025-06-2510:35:03.008 primary gaussdb506 postgres
localhost 1405640916313600[0:0#0] 0 [unknown] 220230 [BACKEND] ACTION:
please check segment space status. 2025-06-2510:35:03.008 primary
gaussdb506 postgres localhost 1405640916313600[0:0#0] 0 [unknown]
220230 [BACKEND] LOCATION: gs_local_free_remain_segment_auto, p
gstatfuncs_segment.cpp:3948 2025-06-2510:35:09.601 primary [unknown]
[unknown] localhost 1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO]
LOG: 00000: [recycle_main:632]update global_recycle_xid:
oldest_xmin=856289, recycle_xmin=856273, globalFrozenXid=856289,
global_recycle_xid=856268, newRecycleXid=856270.
2025-06-2510:35:09.601 primary [unknown] [unknown] localhost
1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO] LOCATION:
recycle_main, storage_undo_recycle.cpp: 632 2025-06-2510:35:09.601
primary [unknown] [unknown] localhost 1405663286371840[0:0#0] 0
[unknown] 000000 [UNDO] LOG: 00000: [recycle_main:648]update
old_global_max_recycle_xid: oldest_xmin=856289, recycle_xmin=856273,
globalFrozenXid=856289, old_global_max_recycle_xid=856267,
new_global_max_recycle_xid=856270. 2025-06-2510:35:09.601 primary
[unknown] [unknown] localhost 1405663286371840[0:0#0] 0 [unknown]
000000 [UNDO] LOCATION: recycle_main, storage_undo_recycle.cpp: 648
2025-06-2510:35:09.618 primary [unknown] [unknown] localhost
1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO] LOG: 00000:
[recycle_main:632]update global_recycle_xid: oldest_xmin=856289,
recycle_xmin=856273, globalFrozenXid=856289,
global_recycle_xid=856270, newRecycleXid=856273.
2025-06-2510:35:09.618 primary [unknown] [unknown] localhost
1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO] LOCATION:
recycle_main, storage_undo_recycle.cpp: 632 2025-06-2510:35:09.618
primary [unknown] [unknown] localhost 1405663286371840[0:0#0] 0
[unknown] 000000 [UNDO] LOG: 00000: [recycle_main:648]update
old_global_max_recycle_xid: oldest_xmin=856289, recycle_xmin=856273,
globalFrozenXid=856289, old_global_max_recycle_xid=856270,
new_global_max_recycle_xid=856272. 2025-06-2510:35:09.618 primary
[unknown] [unknown] localhost 1405663286371840[0:0#0] 0 [unknown]
000000 [UNDO] LOCATION: recycle_main, storage_undo_recycle.cpp: 648
2025-06-2510:35:12.626 primary [unknown] [unknown] localhost
1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO] LOG: 00000:
[recycle_main:632]update global_recycle_xid: oldest_xmin=856289,
recycle_xmin=856274, globalFrozenXid=856289,
global_recycle_xid=856273, newRecycleXid=856274.
2025-06-2510:35:12.626 primary [unknown] [unknown] localhost
1405663286371840[0:0#0] 0 [unknown] 000000 [UNDO] LOCATION:
recycle_main, storage_undo_recycle.cpp:632 2025-06-2510:35:12.626
primary [unknown] [unknown] localhost 1405663286371840[0:0#0] 0
[unknown] 000000 [UNDO] LOG: 00000: [recycle_main:648]update
old_global_m ax_recycle_xid: oldest_xmin=856289, recycle_xmin=856274,
globalFrozenXid=856289, old_global_max_recycle_xid=856272,
new_global_max_recycle_xid=856273. 2025-06-2510:35:12.626 primary
[unknown] [unknown] localhost 1405663286371840[0:0#0] 0 [unknown]
000000 [UNDO] LOCATION: recycle_main, storage_undo_recycle.cpp:648
2025-06-2510:35:12.743 primary [unknown] [unknown] localhost
1405639049848320[0:0#0] 0 [unknown] 000000 [BACKEND] LOG: 00000:
Start autovacuum on database "testdb"2025-06-2510:35:12.743 primary
[unknown] [unknown] localhost 1405639049848320[0:0#0] 0 [unknown]
000000 [BACKEND] LOCATION: autovac_worker_main, autovacuum.cpp:1759
2025-06-2510:35:12.772 primary gaussdb506 testdb localhost
1405639049848320[0:0#0] 0 [unknown] 220230 [BACKEND] LOG: 22023: Spc
open failed. spc_node is: 1663, db_node is 16732
2025-06-2510:35:12.772 primary gaussdb506 testdb localhost
1405639049848320[0:0#0] 0 [unknown] 220230 [BACKEND] CAUSE: segment
is not initialized. 2025-06-2510:35:12.772 primary gaussdb506 testdb
localhost 1405639049848320[0:0#0] 0 [unknown] 220230 [BACKEND] ACTION:
please check segment space status. 2025-06-2510:35:12.772 primary
gaussdb506 testdb localhost 1405639049848320[0:0#0] 0 [unknown] 220230
[BACKEND] LOCATION: gs_local_free_remain_segment_auto,
pgstatfuncs_segment.cpp:3948
LOG: 22023: Spc open failed. spc_node is: 1663, db_node is 16732是段页式的定时检查所产生的日志,就算没有使用段页式的表也会持续打印这个日志,与本文排查的问题无关。
三、GDB 深度调试
此时只能通过跟踪内核堆栈来进行定位(生产环境慎用 GDB,误操作容易宕机)。
1. 准备 GDB 调试环境
获取符号表
从官方发布包 DBS-GaussDB-Kylin-Kernel_25.1.30.10.968045292423872.tar.gz 中开始逐层解压,找到对应平台架构的 GaussDB-Kernel_506.0.0.SPC0100_Symbol_X86_Centralized.tar.gz解压,再解压里面的 GaussDB-Kernel_506.0.0.SPC0100_Kylin_64bit_Symbol.tar,再解压得到 symbols 目录,并将其下的 bin 和 lib 目录复制到 $GAUSSHOME 目录下。
获取进程号
使用 ps 命令找到 gaussdb 主进程的 PID。
> [gaussdb506@ky10-sp3 ~]$ ps -ef|grep gaussdb root 1997065 1992968
> 0 09:46 pts/0 00:00:00 su - gaussdb506 gaussdb+ 1997070 1997065 0
> 09:46 pts/0 00:00:00 -bash root 2001709 2001576 0 09:55 pts/1
> 00:00:00 su - gaussdb506 gaussdb+ 2001716 2001709 0 09:55 pts/1
> 00:00:00 -bash root 2008121 2007968 0 10:07 pts/2 00:00:00 su
> - gaussdb506 gaussdb+ 2008126 2008121 0 10:07 pts/2 00:00:00 -bash gaussdb+ 2014479 1 99 10:19 pts/0 00:23:53
> /data/gaussdb506/app/bin/gaussdb gaussdb+ 2014886 2001716 0 10:20
> pts/1 00:00:00 gsql -r gaussdb+ 2022930 2008126 0 10:35 pts/2
> 00:00:00 vi gaussdb-current.log gaussdb+ 2026667 1997070 0 10:42
> pts/0 00:00:00 ps -ef gaussdb+ 2026668 1997070 0 10:42 pts/0
> 00:00:00 grep gaussdb [gaussdb506@ky10-sp3 ~]$
2. 附加 GDB 到进程
[gaussdb506@ky10-sp3 ~]$ gdb -p 2014479
GDB 启动时可能会出现一些 warning,例如 libthread_db 相关的警告,这些通常不影响本次的问题定位。
为了方便调试,可以执行 set pagination off 关闭分页。 为了避免 GDB 被某些信号中断,可以执行以下命令: handle
SIGUSR1 nostop noprint handle SIGUSR2 nostop noprint handle SIGPIPE
nostop 注意:当 GDB 附加到进程后,整个数据库服务都会被暂停,直到 GDB 分离。
3. 分析问题线程堆栈
切换到问题线程 (thread 12),并使用 bt (backtrace) 命令打印其调用堆栈。
(gdb) t 12 [Switching to thread 12 (LWP 2014538)]
#00x000055a54969f631in list_nth_cell(list=0x7fd8604e8ce0, n=0) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp:654
654
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp:
No such file or directory. (gdb) bt
#0 0x000055a54969f631 in list_nth_cell(list=0x7fd8604e8ce0, n=0) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp:654
#1 0x000055a54969f8c6 in list_nth(list=0x7fd8604e8ce0, n=0) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp:693
#2 0x000055a5497b150e in semtc_check_resolve_unknowns_in_unionop(pstate=0x7fd8604e64d8,
node=) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp:359
#3 0x000055a5497b2f5d in semtc_resolve_target_list_unknowns(pstate=0x7fd8604e64d8,
targetlist=) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp:380
#4 0x000055a549748dcd in lexical_select_stmt(pstate=0x7fd8604e64d8, stmt=0x7fd8604e4c80, is_first_node=,
is_create_view=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:4496
#5 0x000055a54974ecda in lexical_stmt(pstate=0x7fd8604e64d8, parse_tree=0x7fd8604e4c80, is_first_node=true, is_create_view=false)
at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:746
#6 0x000055a549750ae3 in lexical_parse_sub_analyze(parse_tree=0x7fd8604e4c80,
parent_parse_state=0x7fd8604e5d70, parent_cte=0x0,
locked_from_parent=, resolve_unknowns=true) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:342
#7 0x000055a549761dd4 in semtc_range_subselect(r=0x7fd8604e4da0, pstate=0x7fd8604e5d70) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_clause.cpp:901
#8 semtc_from_clauseItem(pstate=0x7fd8604e5d70, n=0x7fd8604e4da0, top_rte=0x7fd86da416e0, top_rti=0x7fd86da416dc, right_rte=0x0,
right_rti=0x0, relnamespace=0x7fd86da416e8, is_first_node=true,
is_create_view=false, is_merge_into=false, add_update_table=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_clause.cpp:1766
#9 0x000055a54976489f in semtc_from_clause(pstate=0x7fd8604e5d70, frm_list=, is_first_node=true, is_create_view=false,
add_update_table=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_clause.cpp:178
#10 0x000055a5497488ff in lexical_select_stmt(pstate=0x7fd8604e5d70, stmt=0x7fd8604e4ef0, is_first_node=,
is_create_view=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:4388
#11 0x000055a54974ecda in lexical_stmt(pstate=0x7fd8604e5d70, parse_tree=0x7fd8604e4ef0, is_first_node=true, is_create_view=false)
at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:746
#12 0x000055a549750ae3 in lexical_parse_sub_analyze(parse_tree=0x7fd8604e4ef0,
parent_parse_state=0x7fd85ed461b0, parent_cte=0x0,
locked_from_parent=false, resolve_unknowns=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:342
#13 0x000055a549750c84 in lexical_set_operation_tree(pstate=0x7fd85ed461b0, stmt=0x7fd8604e4ef0,
is_top_level=false, targetlist=0x7fd86da41b50) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:5269
#14 0x000055a549750f69 in lexical_set_operation_tree(pstate=0x7fd85ed461b0, stmt=0x7fd8604e5010,
is_top_level=true, targetlist=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:5357
#15 0x000055a54974f93d in lexical_set_operation_stmt(stmt=0x7fd8604e5010, pstate=0x7fd85ed461b0)
at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:4908
#16 lexical_stmt(pstate=0x7fd85ed461b0, parse_tree=0x7fd8604e5010, is_first_node=true, is_create_view=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:748
#17 0x000055a549752453 in lexical_top_level_stmt(pstate=0x7fd85ed461b0, parse_tree=, is_first_node=true, is_create_view=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/analyze.cpp:490
#18 0x000055a54a9e6c4e in pg_analyze_and_rewrite_params(parsetree=0x7fd8604e5010,
query_string=0x7fd84e9b60a8"select 1 from dual\nunion all \nselect
null from \n (select a from \n (select a from \n (select null a
from dual \n )))“, parserSetup=0x55a54c221710
<gsplsql_parser_setup(ParseState*, GSPLSQLExpr*)>,
parserSetupArg=0x7fd8602e84d8, is_parameterized=false,
m_param_info=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/postgres.cpp:1323
#19 0x000055a54c229bf5 in get_cursor_tuple_desc(expr=0x7fd8602e84d8, isOnlySelect=false, isOnlyParse=true, forceCompile=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/pl/plpgsql/src/pl_comp.cpp:3185
#20 0x000055a54c21a26c in gsplsql_yyparse() at gram.y:5911
#21 0x000055a54a03a1e9 in gsplsql_yyparse_save_restore(parent_compile_cxt=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/gausskernel/pl/plsql/pl_comp/pl_comp_yylex.cpp:1029
#22 0x000055a54a00f6eb in do_compile(fcinfo=0x7fd86da45d40, proc_tup=0x7fd8602490e0, func=0x7fd8604be050,
compile_func_head_info=0x7fd86da465d0, for_validator=true,
hashkey=0x7fd86da459d0) at
/usr1/GaussDBKernel/server/opengauss/src/gausskernel/pl/plsql/pl_comp/pl_comp_func_main.cpp:1529
#23 0x000055a54a014f3b in gsplsql_compile(fcinfo=0x7fd86da45d40, compile_func_head_info=0x7fd86da465d0, for_validator=true,
isRecompile=false, func_runtime_state=0x7fd86da45b98) at
/usr1/GaussDBKernel/server/opengauss/src/gausskernel/pl/plsql/pl_comp/pl_comp_func_main.cpp:3106
#24 0x000055a54c243eeb in plpgsql_validator(fcinfo=) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/pl/plpgsql/src/pl_handler.cpp:1481
#25 0x000055a549e846cb in OidFunctionCall4Coll(function_id=10790, collation=0, arg1=64306, arg2=0, arg3=0, arg4=140567529154000,
is_null=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/share/fmgr/fmgr.cpp:2512
#26 0x000055a5494e9c26 in ProcedureCreate(procedureName=0x7fd85ed28b88"test_p_union”,
procNamespace=2200, propackageid=0, isOraStyle=,
replace=false, returnsSet=, returnType=2278,
proOwner=10, languageObjectId=10791, languageValidator=10790,
prosrc=0x7fd85ed29d10" DECLARE \ncur sys_refcursor;\nbegin\nopen cur
for\nselect 1 from dual\nunion all \nselect null from \n (select a
from \n (select a from \n (select null a from dual \n )\n
)\n );\n end", probin=, isAgg=false,
isWindowFunc=false, security_definer=true, isLeakProof=false,
isStrict=false, volatility=118’v’, parameterTypes=0x7fd85ed46170,
paramTypDependExt=0x0, retTypDependExt=0x7fd7a190e050,
allParameterTypes=0, parameterModes=0, parameterNames=0,
parameterDefaults=0x0, proconfig=0, procost=100, prorows=0,
prodefaultargpos=0x0, fenced=false, shippable=false,
package=, proIsProcedure=true, proargsrc=0x0,
isPrivate=false, isTrigger=false, stmt=0x7fd85ed28c58,
create_func_type_enum=CREATE_SCHEMA_FUNC,
func_definition=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n"…,
compile_func_head_info=0x7fd86da47210) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/catalog/pg_proc.cpp:2582
#27 0x000055a54a636aa3 in sqlcmd_create_function(stmt=0x7fd85ed28c58, query_string=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n"…,
compile_func_head_info=0x7fd86da47210, pkg_oid=0, isTrigger=false) at
/usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/commands/functioncmds.cpp:1550
#28 0x000055a54aa0dc05 in sqlcmd_standard_process_utility(parse_tree=0x7fd85ed28c58,
query_string=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n"…, params=0x0,
is_top_level=, dest=0x7fd85ee62170,
sent_to_remote=, completion_tag=0x7fd86da4a260"“,
isCTAS=false) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/utility.cpp:6308
#29 0x00007fd9cc19d759 in gsaudit_ProcessUtility_hook(parsetree=0x7fd85ed28c58,
queryString=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n”…, params=0x0,
isTopLevel=, dest=0x7fd85ee62170,
sentToRemote=, completionTag=0x7fd86da4a260"“,
isCTAS=false) at
/usr1/GaussDBKernel/server/opengauss/src/gausskernel/security/security_plugin/security_policy_plugin.cpp:856
#30 0x000055a54af7ff52 in audit_process_utility(parsetree=0x7fd85ed28c58,
query_string=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n”…, params=, is_top_level=, dest=,
sent_to_remote=, completion_tag=0x7fd86da4a260"“,
is_ctas=false) at
/usr1/GaussDBKernel/server/opengauss/src/gausskernel/security/audit/security_auditfuncs.cpp:1512
#31 0x000055a54aa1b71d in sqlcmd_process_utility(parse_tree=0x7fd85ed28c58,
query_string=0x7fd85ee62050"create procedure test_p_union is \ncur
sys_refcursor;\nbegin\nopen cur for\nselect 1 from dual\nunion all
\nselect null from \n (select a from \n (select a from \n
(select null a from dual \n )\n )\n”…, params=0x0,
is_top_level=, dest=,
sent_to_remote=, completion_tag=0x7fd86da4a260"“,
isCTAS=false) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/utility.cpp:1974
#32 0x000055a54a9fc83f in PortalRunUtility(portal=0x7fd8602e4050, utilityStmt=0x7fd85ed28c58, isTopLevel=true, dest=0x7fd85ee62170,
completionTag=0x7fd86da4a260"”) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/pquery.cpp:2140
#33 0x000055a54a9fe0be in PortalRunMulti(portal=0x7fd8602e4050, isTopLevel=true, dest=0x7fd85ee62170, altdest=0x7fd85ee62170,
completionTag=0x7fd86da4a260"“, snapshot=0x0, bii_state=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/pquery.cpp:2326
#34 0x000055a54aa022dc in PortalRun(portal=0x7fd8602e4050, count=9223372036854775807, isTopLevel=true, dest=0x7fd85ee62170,
altdest=0x7fd85ee62170, completionTag=0x7fd86da4a260"”, snapshot=0x0,
bii_state=0x0) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/pquery.cpp:1501
#35 0x000055a54a9e8276 in exec_simple_query(query_string=, msg=0x7fd86da4a530, messageType=QUERY_MESSAGE) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/postgres.cpp:3513
#36 0x000055a54a9f4e38 in gs_process_command(firstchar=, input_message=0x7fd86da4a530,
send_ready_for_query=0x7fd86da4a526) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/postgres.cpp:11743
#37 0x000055a54a9fa9c0 in PostgresMain(argc=, argv=0x7fd8645f5b20, dbname=, username=)
at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/tcop/postgres.cpp:11313
#38 0x000055a54a97e2df in backend_run(port=0x7fd86da4a890) at /usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/postmaster/postmaster.cpp:12482
#39 0x000055a54a9bd1b0 in gauss_db_worker_thread_main<(knl_thread_role)2> (arg=)
at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/postmaster/postmaster.cpp:19086
#40 0x000055a54a97e39a in internal_thread_func(args=) at
/usr1/GaussDBKernel/server/opengauss/src/auxiliary/proc/postmaster/postmaster.cpp:20196
#41 0x00007fd9cc569f1b in ?? () from /usr/lib64/libpthread.so.0
#42 0x00007fd9cc4a11c0 in clone() from /usr/lib64/libc.so.6 (gdb)
从堆栈信息可以推断出执行流程:
exec_simple_query: 执行SQL语句
ProcedureCreate: 创建存储过程plpgsql_validator-> gsplsql_compile -> gsplsql_yyparse: 对 PL/SQL 代码进行校验、编译和解析
get_cursor_tuple_desc: 获取游标的元组描述(字段信息)
pg_analyze_and_rewrite_params: 解析 SQL 和重写参数
semtc_resolve_target_list_unknowns-> semtc_check_resolve_unknowns_in_unionop: 处理 SELECT 列表和 UNION 操作中的 unknown 类型
list_nth和 list_nth_cell: 最终陷入了对链表的操作
推测:在创建存储过程时,内核需要编译和解析其中的 OPEN CURSOR 语句,以确定游标返回的列类型。当处理到一个包含层层嵌套子查询、并且 UNION ALL 两侧存在 NULL(即 unknown 类型)的复杂 SQL 时,内核在类型推断的逻辑中陷入了死循环。
4. 提取关键 SQL 并确认死循环
从堆栈信息 #18 中可以提取出导致问题的核心 SQL:
select1from dual
unionall
selectnullfrom(select a from(select a from(selectnull a from dual )));
通过在 GDB 中使用 n (next) 命令进行单步调试,可以观察到程序指针在 parse_target.cpp 文件的几行代码之间反复跳转,从而证实了循环的存在。至于是不是死循环,需要继续n,并观察每次循环中变量的变化,篇幅有限就不全部贴出来了。结论是的确存在死循环,但死循环跨越了几个文件,并不是仅在 parse_target.cpp 中循环。
(gdb) t 12
[Switching to thread 12 (LWP 2014538)]
#00x000055a54969f869in list_nth(list=0x7fd8604e8ef8, n=0) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp:691
691 /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp: No such file or directory.
(gdb) n
[LWP 2069741 exited]
[New LWP 2069918]
692 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp
(gdb) n
[LWP 2069743 exited]
[New LWP 2069921]
693 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp
(gdb) n
[LWP 2069918 exited]
[New LWP 2069922]
694 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/nodes/list.cpp
(gdb) n
[LWP 2069921 exited]
[LWP 2069922 exited]
semtc_check_resolve_unknowns_in_unionop(pstate=0x7fd8604e64d8, node=<optimized out>) at /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp:354
354 /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp: No such file or directory.
(gdb) n
[New LWP 2070509]
359 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
360 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070532]
[New LWP 2070533]
338 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070532 exited]
[New LWP 2070535]
342 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070536]
[LWP 2070533 exited]
346 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070537]
[LWP 2070535 exited]
350 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
351 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070536 exited]
352 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070537 exited]
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
354 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
359 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070509 exited]
[New LWP 2070587]
360 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
338 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070587 exited]
342 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070611]
346 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
350 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070611 exited]
351 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
352 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070637]
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
354 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
359 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070640]
360 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
338 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070640 exited]
342 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070663]
[New LWP 2070664]
346 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
350 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
351 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070663 exited]
352 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070664 exited]
354 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
359 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070693]
[New LWP 2070694]
360 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
338 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070693 exited]
[New LWP 2070710]
[New LWP 2070711]
342 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070694 exited]
346 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070720]
[LWP 2070711 exited]
350 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
351 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
352 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
354 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070720 exited]
[New LWP 2070746]
359 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
360 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070710 exited]
338 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070746 exited]
[New LWP 2070771]
[New LWP 2070772]
342 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
346 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070771 exited]
350 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[LWP 2070772 exited]
351 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
352 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070797]
353 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb) n
[New LWP 2070800]
354 in /usr1/GaussDBKernel/server/opengauss/src/compatibility/sql_adaptor/parser/parse_target.cpp
(gdb)
最后,按 q 退出 GDB,让数据库恢复运行。(gdb) q
A debugging session is active.Inferior 1 [process 2014479] will be detached.Quit anyway? (y or n) y
Detaching from program: /data/gaussdb506/app/bin/gaussdb, process 2014479
[Inferior 1 (process 2014479) detached]
四、问题根因分析与规避
1. 问题复现与分析
将上述从 GDB 提取的 SQL 单独在 gsql 中执行 EXPLAIN,会话同样会卡死,并且 gaussdb 进程会瞬间占满一个新的 CPU核心。
原始提取出来的SQL有两百多行,比较复杂,可以通过把变量变常量、逐个字段剔除、逐层子查询去掉、并且尝试变更select的内容,并且测试是否会卡住。
分析该 SQL 的特征:
select1from dual --任意查询,常量/变量/字段无关
unionall--带union
selectnullfrom--套子查询,常量/变量/字段无关(select a from--第二次引用null的字段(select a from--第一次引用null的字段(selectnull a from dual --最里层有null)));
使用了 UNION ALL。
UNION ALL的一部分是一个 SELECT NULL,NULL 的类型是未知的 (unknown)。
这个 NULL 被层层嵌套的子查询引用。
这似乎与 GaussDB 506.0 版本的一个特性更新有关。版本说明中提到:
union操作中支持其余类型字段与NULL(text类型)进行合并
规格约束:当union左边为非text值,union右边select语句中的null如果存在于子查询,那么子查询需要跟随from子句
这个更新是为了解决旧版本中 INTEGER(或其他非文本类型) 与 NULL (被隐式当作 TEXT) UNION 时会报错的问题,以增强 Oracle 兼容性。例如,以下 SQL 在旧版本会报错,但在 506.0 版本中可以正常执行:
gaussdb=# select * from (select a from (select 1 a from dual) union all select a from (select null a from dual) );
ERROR: UNION types integer and text cannot be matched
LINE 1: ...ct a from (select1 a from dual) unionallselect a from (se...^
gaussdb=# select version();
version
---------------------------------------------------------------------------------------------------------------------------gaussdb (GaussDB Kernel 505.2.0.SPC0100 build8db8eac8) compiledat2024-11-0219:03:49commit9980last mr 20502release
(1row)gaussdb=#
gaussdb=# select * from (select a from (select 1 a from dual) union all select a from (select null a from dual) );a
---1(2 rows)gaussdb=# select version();version
---------------------------------------------------------------------------------------------------------------gaussdb (GaussDB Kernel 506.0.0.SPC0100 build e324981f) compiled at 2025-04-27 14:27:52 last mr 23420 release
(1row)gaussdb=#
然而,这个为了兼容性所做的修改,在处理更复杂的嵌套子查询引用 NULL 的场景时,意外地引入了类型推断的死循环 Bug。
2. 解决方案与规避
升级数据库内核补丁
这个问题华为云应该已经发现了,但不确定是否已经发布对应补丁
紧急处理
由于会话无法被正常 Kill,且是单进程多线程模型,不能在操作系统层面单独 Kill 线程。因此,唯一恢复服务的办法是重启整个数据库实例。
代码规避
要从根本上避免触发这个问题,必须在 SQL 中对 NULL 进行显式类型转换,杜绝 unknown 类型的出现。可以使用 CAST 函数或者 :: 类型转换语法。
例如,将有问题的 SQL 修改为以下任意一种形式即可:
(这里number类型只是例子,实际类型需要保证和union另一侧的类型一致)
-- 使用 ::number 进行类型转换
select1from dual
unionall
selectnull::numberfrom(select a from(select a from(selectnull::number a from dual )));
-- 使用 CAST AS number 进行类型转换
select1from dual
unionall
selectcast(nullasnumber) from(select a from(select a from(selectcast(nullasnumber) a from dual )));
五、总结
本文详细记录了一次 GaussDB 疑难问题的完整诊断过程,从现象观察、初步诊断、资源监控到最终通过 GDB 进行深度调试,成功定位了内核级别的 Bug。
这个案例表明,对于复杂的数据库问题,仅仅掌握 SQL 和基本运维是远远不够的。深入理解数据库内核的基本原理、熟悉 GDB 等调试工具,对于问题定位至关重要。希望本文的诊断思路能为读者在未来遇到类似问题时提供有价值的参考。
对于软件开发而言,新功能引入新问题非常常见,强如甲骨文公司,在其ORACLE数据库上也曾一个BUG修了十几年发布了n个针对性补丁才修好,所以不能根据一个BUG来评判软。