当前位置: 首页 > news >正文

xfs inode cluster lock order导致的死锁

xfs出现死锁后,成功获得了vmcore,经过分析之后,发现是xfs_buf的lock order问题。这里大体贴下定位过程,尤其是从调用栈到获得xfs_buf的持有者的部分,毕竟这种问题已经不是第一次遇见了。

拿到vmcore之后,我们可以进一步分析hung task的现场,在大量的hung task调用栈中,这个调用栈比较典型,

--------------------------------------------
Tick 0 Task 20909 node in D state
--------------------------------------------
[<0>] down+0x43/0x60
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find+0x17a/0x370 [xfs]
[<0>] xfs_buf_get_map+0x46/0x400 [xfs]
[<0>] xfs_buf_read_map+0x54/0x280 [xfs]
[<0>] xfs_trans_read_buf_map+0x12d/0x300 [xfs]
[<0>] xfs_read_agi+0x8f/0x130 [xfs]
[<0>] xfs_ialloc_read_agi+0x26/0xb0 [xfs]
[<0>] xfs_dialloc+0x162/0x3d0 [xfs]
[<0>] xfs_create+0x3d7/0x5e0 [xfs]
[<0>] xfs_generic_create+0x11e/0x370 [xfs]
[<0>] vfs_mkdir+0x157/0x210
[<0>] ovl_create_real+0x172/0x280 [overlay]
[<0>] ovl_create_or_link+0x124/0x280 [overlay]
[<0>] ovl_create_object+0xe4/0x110 [overlay]
[<0>] vfs_mkdir+0x157/0x210
[<0>] do_mkdirat+0x92/0x130
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
--------------------------------------------

之所以说典型,是因为xfs 经常会发生agi/agi之间的lock order问题,所以,我们首先深入分析这个调用栈;

首先要确定xfs_buf的地址,

crash> dis down
0xffffffffadb54ea0 <down>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffadb54ea5 <down+5>:    push   %rbp
0xffffffffadb54ea6 <down+6>:    mov    %rdi,%rbp
....
0xffffffffadb54ed7 <down+55>:   mov    %rbp,%rdi^^^^^^^^^^^^^^^^^^^
0xffffffffadb54eda <down+58>:   mov    %rax,(%rsp)
0xffffffffadb54ede <down+62>:   call   0xffffffffae48a870 <__down>crash> dis __down
0xffffffffae48a870 <__down>:    nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffae48a875 <__down+5>:  push   %rbp^^^^^^^^^^^^
0xffffffffae48a876 <__down+6>:  mov    %rsp,%rbp
0xffffffffae48a879 <__down+9>:  push   %r13
0xffffffffae48a87b <__down+11>: lea    0x8(%rdi),%r13
0xffffffffae48a87f <__down+15>: push   %r12
0xffffffffae48a881 <__down+17>: mov    %r13,%rdx

通过这两个函数的调用栈,我们可以知道,__down()栈空间的第二个地址就是xfs_buf.b_sema的地址,也就是:

crash> bt 20909                        
PID: 20909    TASK: ff3321d27d310000  CPU: 34   COMMAND: "node"#0 [ff3973559bf6f8b0] __schedule at ffffffffae488313#1 [ff3973559bf6f928] schedule at ffffffffae4886c3#2 [ff3973559bf6f940] schedule_timeout at ffffffffae48cf55#3 [ff3973559bf6f9a8] __down at ffffffffae48a8f9#4 [ff3973559bf6fa00] down at ffffffffadb54ee3#5 [ff3973559bf6fa18] xfs_buf_lock at ffffffffc0a1339d [xfs]#6 [ff3973559bf6fa28] xfs_buf_find at ffffffffc0a135ca [xfs]#7 [ff3973559bf6fa88] xfs_buf_get_map at ffffffffc0a138c6 [xfs]#8 [ff3973559bf6fae8] xfs_buf_read_map at ffffffffc0a148f4 [xfs]#9 [ff3973559bf6fb48] xfs_trans_read_buf_map at ffffffffc0a4cddd [xfs]
#10 [ff3973559bf6fba8] xfs_read_agi at ffffffffc09f853f [xfs]
#11 [ff3973559bf6fbf8] xfs_ialloc_read_agi at ffffffffc09f8606 [xfs]
#12 [ff3973559bf6fc20] xfs_dialloc at ffffffffc09f8dc2 [xfs]
#13 [ff3973559bf6fca8] xfs_create at ffffffffc0a2b9e7 [xfs]
#14 [ff3973559bf6fd40] xfs_generic_create at ffffffffc0a26fbe [xfs]
#15 [ff3973559bf6fdc0] vfs_mkdir at ffffffffadd82a77
#16 [ff3973559bf6fdf8] ovl_create_real at ffffffffc0d43e32 [overlay]
#17 [ff3973559bf6fe20] ovl_create_or_link at ffffffffc0d45694 [overlay]
#18 [ff3973559bf6fe68] ovl_create_object at ffffffffc0d458d4 [overlay]
#19 [ff3973559bf6feb0] vfs_mkdir at ffffffffadd82a77
#20 [ff3973559bf6fee8] do_mkdirat at ffffffffadd86302
#21 [ff3973559bf6ff38] do_syscall_64 at ffffffffae47cd18
#22 [ff3973559bf6ff50] entry_SYSCALL_64_after_hwframe at ffffffffae60007c
...
crash> rd ff3973559bf6f940 -s 32
ff3973559bf6f940:  schedule_timeout+277 7fffffffffffffff 
ff3973559bf6f950:  ff3321c17966aa80 __d_add+218      
ff3973559bf6f960:  ff332160a1047200 0000000000000000 
ff3973559bf6f970:  ff3321c17966aa80 0000000000000000 
ff3973559bf6f980:  ff3973559bf6fa58 d_splice_alias+135 
ff3973559bf6f990:  cb55fb6cd198b300 ff3973559bf6f9f8 
ff3973559bf6f9a0:  7fffffffffffffff __down+137       
ff3973559bf6f9b0:  ff397355d84c7780 ff397355b74d7780 
ff3973559bf6f9c0:  ff3321d27d310000 0000000000000000 
ff3973559bf6f9d0:  cb55fb6cd198b300 cb55fb6cd198b300 
ff3973559bf6f9e0:  0000000000000000 ff3973559bf6faa0 
ff3973559bf6f9f0:  ff332153415b9980 ff332153415b99a0 ^^^^^^^^^^^^^^^
ff3973559bf6fa00:  down+67          0000000000000246 
ff3973559bf6fa10:  ff332153415b9980 xfs_buf_lock+45  
ff3973559bf6fa20:  ff3321b3a159a000 xfs_buf_find+378 
ff3973559bf6fa30:  ff3321b2523c2400 00000001c0a29566 

注意直接拿到的地址需要做一下偏移处理,才能得到xfs_buf的地址。

crash> xfs_buf.b_sema -ox
struct xfs_buf {[0x20] struct semaphore b_sema;
}

得到xfs_buf之后,我们如何获取是谁拿着它的锁呢?

这里我们可以参考以下关系:

xfs_buf.b_transp -> xfs_trans
xfs_trans.t_ticket -> xlog_ticket
xlog_ticket.t_task -> task_struct

通过以上步骤,我们知道了拿着agi的xfs_buf的进程,

crash> bt 0xff3321d053ca0000 
PID: 63047    TASK: ff3321d053ca0000  CPU: 112  COMMAND: "xxxx"#0 [ff397355d35e7590] __schedule at ffffffffae488313#1 [ff397355d35e7608] schedule at ffffffffae4886c3#2 [ff397355d35e7620] schedule_timeout at ffffffffae48cf55#3 [ff397355d35e7688] __down at ffffffffae48a8f9#4 [ff397355d35e76e8] down at ffffffffadb54ee3#5 [ff397355d35e7700] xfs_buf_lock at ffffffffc0a1339d [xfs]#6 [ff397355d35e7710] xfs_buf_find at ffffffffc0a135ca [xfs]#7 [ff397355d35e7770] xfs_buf_get_map at ffffffffc0a138c6 [xfs]#8 [ff397355d35e77d0] xfs_buf_read_map at ffffffffc0a148f4 [xfs]#9 [ff397355d35e7830] xfs_trans_read_buf_map at ffffffffc0a4cddd [xfs]
#10 [ff397355d35e7890] xfs_imap_to_bp at ffffffffc09fd20e [xfs]
#11 [ff397355d35e78c0] xfs_trans_log_inode at ffffffffc0a0b076 [xfs]
#12 [ff397355d35e7920] xfs_dir2_sf_addname at ffffffffc09f45ad [xfs]
#13 [ff397355d35e7998] xfs_dir_createname at ffffffffc09e9767 [xfs]
#14 [ff397355d35e79f0] xfs_link at ffffffffc0a2a46a [xfs]
#15 [ff397355d35e7a40] xfs_vn_link at ffffffffc0a25f43 [xfs]
#16 [ff397355d35e7a80] vfs_link at ffffffffadd83012
#17 [ff397355d35e7ad0] ovl_copy_up_tmpfile at ffffffffc0d48f05 [overlay]
#18 [ff397355d35e7b20] ovl_do_copy_up at ffffffffc0d49399 [overlay]
#19 [ff397355d35e7b48] ovl_copy_up_one at ffffffffc0d49674 [overlay]
#20 [ff397355d35e7d30] ovl_copy_up_flags at ffffffffc0d498cb [overlay]
#21 [ff397355d35e7d78] ovl_setattr at ffffffffc0d407af [overlay]
#22 [ff397355d35e7db8] notify_change at ffffffffadd96b34
#23 [ff397355d35e7e28] chown_common at ffffffffadd6fd06
#24 [ff397355d35e7ee0] do_fchownat at ffffffffadd6fe4d
#25 [ff397355d35e7f30] __x64_sys_fchownat at ffffffffadd6febb
#26 [ff397355d35e7f38] do_syscall_64 at ffffffffae47cd18

这里的进程在等一个inode所在的xfs_buf,地址为ff3321b8fa699980;我们同样通过同样的方法,找到这个xfs_buf的持有者,

crash> bt 0xff332175359722c0 
PID: 55547    TASK: ff332175359722c0  CPU: 71   COMMAND: "xxxx"#0 [ff397355d1bdf660] __schedule at ffffffffae488313#1 [ff397355d1bdf6d8] schedule at ffffffffae4886c3#2 [ff397355d1bdf6f0] schedule_timeout at ffffffffae48cf55#3 [ff397355d1bdf758] __down at ffffffffae48a8f9#4 [ff397355d1bdf7b8] down at ffffffffadb54ee3#5 [ff397355d1bdf7d0] xfs_buf_lock at ffffffffc0a1339d [xfs]#6 [ff397355d1bdf7e0] xfs_buf_find at ffffffffc0a135ca [xfs]#7 [ff397355d1bdf840] xfs_buf_get_map at ffffffffc0a138c6 [xfs]#8 [ff397355d1bdf8a0] xfs_buf_read_map at ffffffffc0a148f4 [xfs]#9 [ff397355d1bdf900] xfs_trans_read_buf_map at ffffffffc0a4cddd [xfs]
#10 [ff397355d1bdf960] xfs_imap_to_bp at ffffffffc09fd20e [xfs]
#11 [ff397355d1bdf990] xfs_trans_log_inode at ffffffffc0a0b076 [xfs]
#12 [ff397355d1bdf9f0] xfs_link at ffffffffc0a2a4a9 [xfs]
#13 [ff397355d1bdfa40] xfs_vn_link at ffffffffc0a25f43 [xfs]
#14 [ff397355d1bdfa80] vfs_link at ffffffffadd83012
#15 [ff397355d1bdfad0] ovl_copy_up_tmpfile at ffffffffc0d48f05 [overlay]
#16 [ff397355d1bdfb20] ovl_do_copy_up at ffffffffc0d49399 [overlay]
#17 [ff397355d1bdfb48] ovl_copy_up_one at ffffffffc0d49674 [overlay]
#18 [ff397355d1bdfd30] ovl_copy_up_flags at ffffffffc0d498cb [overlay]
#19 [ff397355d1bdfd78] ovl_setattr at ffffffffc0d407af [overlay]
#20 [ff397355d1bdfdb8] notify_change at ffffffffadd96b34
#21 [ff397355d1bdfe28] chown_common at ffffffffadd6fd06
#22 [ff397355d1bdfee0] do_fchownat at ffffffffadd6fe4d
#23 [ff397355d1bdff30] __x64_sys_fchownat at ffffffffadd6febb
#24 [ff397355d1bdff38] do_syscall_64 at ffffffffae47cd18
#25 [ff397355d1bdff50] entry_SYSCALL_64_after_hwframe at ffffffffae60007c

它在等的xfs_buf是ff3321c0f1279080,再用同样的方法追踪55547的调用栈,,就会回到63047,所以,问题的原因应该是它们两个之间的问题,

之后,我们再通过以下关系,看看两个进程的xfs_trans里存在log item,

xfs_trans.t_items -> xfs_log_item.li_trans
xfs_log_item.li_ops -> log item类型
如果是xfs_buf_item_ops,就是xfs_buf,然后通过以下方式获得xfs_buf
xfs_log_item -> xfs_buf_log_item.bli_item
xfs_buf_log_item.bli_buf -> xfs_buf

最终得到:

63047     xfs_dir_createnamecrash> xfs_log_item.li_ops ff3321d097607140 li_ops = 0xffffffffc0a6f400 <xfs_inode_item_ops>,
crash> xfs_log_item.li_ops ff3321c58d3cc0c0li_ops = 0xffffffffc0a6f400 <xfs_inode_item_ops>,
crash> xfs_log_item.li_ops ff3321c028ea0420li_ops = 0xffffffffc0a6efa0 <xfs_buf_item_ops>,
crash> xfs_log_item.li_ops ff33215eba3e1290li_ops = 0xffffffffc0a6efa0 <xfs_buf_item_ops>,crash> xfs_buf_log_item.bli_buf ff3321c028ea0420bli_buf = 0xff332153415b9980,
crash> xfs_buf_log_item.bli_buf ff33215eba3e1290bli_buf = 0xff3321c0f1279080,
crash> xfs_buf.b_ops 0xff332153415b9980b_ops = 0xffffffffc0a6d100 <xfs_agi_buf_ops>
crash> xfs_buf.b_ops 0xff3321c0f1279080b_ops = 0xffffffffc0a6d3a0 <xfs_inode_buf_ops>55547     xfs_trans_log_inode
crash> xfs_log_item.li_ops ff332164cdcae9c0li_ops = 0xffffffffc0a6f400 <xfs_inode_item_ops>,
crash> xfs_log_item.li_ops ff332174a1813a40li_ops = 0xffffffffc0a6f400 <xfs_inode_item_ops>,
crash> xfs_log_item.li_ops ff332160278cd080li_ops = 0xffffffffc0a6efa0 <xfs_buf_item_ops>,
crash> xfs_log_item.li_ops ff3321674a3b0d68li_ops = 0xffffffffc0a6efa0 <xfs_buf_item_ops>,
crash> xfs_log_item.li_ops ff3321674a3b2d60li_ops = 0xffffffffc0a6efa0 <xfs_buf_item_ops>,
crash> xfs_buf_log_item.bli_buf ff332160278cd080bli_buf = 0xff33215411fc8600,
crash> xfs_buf_log_item.bli_buf ff3321674a3b0d68bli_buf = 0xff3321b8fa699980,
crash> xfs_buf_log_item.bli_buf ff3321674a3b2d60bli_buf = 0xff3321d36e5a1680,crash> xfs_buf.b_ops 0xff33215411fc8600 b_ops = 0xffffffffc0a6d100 <xfs_agi_buf_ops>
crash> xfs_buf.b_ops 0xff3321b8fa699980 b_ops = 0xffffffffc0a6d3a0 <xfs_inode_buf_ops>
crash> xfs_buf.b_ops 0xff3321d36e5a1680 b_ops = 0xffffffffc0a6cde0 <xfs_dir3_block_buf_ops>

所以,最终我们得到:

63047     xfs_dir_createname
hold      xfs_buf ff3321c0f1279080
requires  xfs_buf ff3321b8fa69998055547     xfs_trans_log_inode
hold      xfs_buf ff3321b8fa699980
requires  xfs_buf ff3321c0f1279080

那么问题的原因就是,xfs_link()函数中形成了一个在两个xfs_buf之间的ABBA型的死锁;

形成死锁的原因是:

  • xfs inode存储在一个xfs inode cluster buf上,也就是一个xfs_buf上可能存储多个inode;
  • xfs_link这个函数存在两处会log inode cluster xfs_buf的位置,如下:
xfs_link()-> xfs_iunlink_remove()-> xfs_iunlink_update_inode()-> xfs_imap_to_bp()-> xfs_dir_createname()-> xfs_dir2_sf_addname()-> xfs_trans_log_inode()-> xfs_imap_to_bp()-> xfs_trans_log_inode()-> xfs_imap_to_bp()考虑调用vfs_link()的是ovl_copy_up_tmpfile(),xfs_iunlink_remove()这个调用关系是成立的。
ovl_copy_up_tmpfile()
---temp = ovl_do_tmpfile(c->workdir, c->stat.mode);...upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);err = PTR_ERR(upper);if (!IS_ERR(upper)) {err = ovl_do_link(temp, udir, upper);^^^^a tmp filedput(upper);}
---

该问题在以下commit中被修复,

commit 82842fee6e5979ca7e2bf4d839ef890c22ffb7aa
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 04:08:27 2023 +1000xfs: fix AGF vs inode cluster buffer deadlock.....

虽然标题说的是xfs_link()存在的另外一个死锁问题,但是,其修复代码中考虑了xfs_buf的lock order问题,参考:

+ * We have to be careful when we grab the inode cluster buffer due to lock
+ * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
+ * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
+ * not locked until ->precommit, so it happens after everything else has been
+ * modified.
....
__xfs_trans_commit()
-> xfs_trans_run_precommits()
---list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);-> xfs_inode_item_sort()-> INODE_ITEM(lip)->ili_inode->i_ino;...*/list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {...if (lip->li_ops->iop_precommit) {error = lip->li_ops->iop_precommit(tp, lip);-> xfs_inode_item_precommit()-> xfs_imap_to_bp()...}}
---

解决办法如下:

  • 在xfs_trans_log_inode()等函数中不会再调用xfs_buf_lock(),而是推迟到__xfs_trans_commit() -> precommit()
  • 在xfs_trans_run_precommits()中将xfs_buf进行排序,以此保证不会出现ABBA这样的lock order问题

文章转载自:

http://xcgYjkWY.nfnxp.cn
http://jc9YwfFh.nfnxp.cn
http://Jfawn1KT.nfnxp.cn
http://wfuXJlOR.nfnxp.cn
http://KcGkAXtr.nfnxp.cn
http://eNQnBBTi.nfnxp.cn
http://NISnFGQZ.nfnxp.cn
http://iG23ClJJ.nfnxp.cn
http://cXFEUy7X.nfnxp.cn
http://EpFk5TTm.nfnxp.cn
http://ClBfy9au.nfnxp.cn
http://26xS9YiJ.nfnxp.cn
http://aL6UxKhE.nfnxp.cn
http://5Ko1lIqn.nfnxp.cn
http://3YthO2fs.nfnxp.cn
http://kRPJvt6S.nfnxp.cn
http://5rh9aVkD.nfnxp.cn
http://Fivl9R8K.nfnxp.cn
http://hCsQkwdO.nfnxp.cn
http://nW0JajAo.nfnxp.cn
http://5QVCTZVS.nfnxp.cn
http://OiSPs0v8.nfnxp.cn
http://uY3AsKJO.nfnxp.cn
http://cJvvWMWE.nfnxp.cn
http://l0DYP4pD.nfnxp.cn
http://M9vFtJqi.nfnxp.cn
http://efSyJdTi.nfnxp.cn
http://DJD1W71b.nfnxp.cn
http://qUpQiert.nfnxp.cn
http://f1RB0ghh.nfnxp.cn
http://www.dtcms.com/a/373594.html

相关文章:

  • @PostMapping 是什么
  • Vue笔记2+3
  • Android 倒车影像
  • 哈希表-49.字母异位词分组-力扣(LeetCode)
  • JLINK 调试器单步调试单片机
  • AWS TechFest 2025: 智能体企业级开发流程、Strands Agents
  • Cy3-Tyramide,Cyanine 3 Tyramide; 174961-75-2
  • Neural Jacobian Field学习笔记 - jaxtyping
  • 从0到1学习Vue框架Day02
  • 人工智能学习:Transformer结构(编码器及其掩码张量)
  • ThreeJS骨骼示例
  • 网络工程师软考:网络自动化与可编程网络深度解析
  • 天工开物:耐达讯自动化RS232转ProfiBus网关连接变频器的“重生“术
  • WPF资源字典合并报错
  • DevExpress WPF 中文教程:如何将 WPF 数据网格绑定虚拟数据源?
  • TypeORM 入门教程:@ManyToOne 与 @OneToMany 关系详解
  • 开关电源基础知识
  • C++-RAII
  • nginx反向代理,负载均衡,tomcat的数据流向图篇解析
  • 独立站SEO优化:如何应用移动代理IP提升关键词排名?
  • Linux初始——cgdb
  • 【T2I】Discriminative Probing and Tuning for Text-to-Image Generation
  • Vue: ref、reactive、shallowRef、shallowReactive
  • HarmonyOS 应用开发深度解析:基于 ArkTS 的跨组件状态管理最佳实践
  • 鸿蒙系统下的智能设备故障检测实战:从监控到自愈的全流程实现
  • windows11备份系统盘
  • 小迪web自用笔记31
  • 【前端埋点】纯前端实现 A/B Test
  • Vue3+Cesim ^1.122.0 Home按钮位置自定义;时间轴UTC时间转化为北京时间
  • 第五十五天(SQL注入增删改查HTTP头UAXFFRefererCookie无回显报错复盘)