NFS writeback流程中的GFP_NOFS
前言
最近,在4.4低版本内核上遇到了以下hung task:
[<ffffffff8119592f>] wait_on_page_bit+0xdf/0xf0
[<ffffffff811ab973>] shrink_page_list+0x733/0x7d0
[<ffffffff811ac16a>] shrink_inactive_list+0x1ea/0x5a0
[<ffffffff811aca0c>] shrink_lruvec+0x12c/0x370
[<ffffffff811accec>] shrink_zone+0x9c/0x2c0
[<ffffffff811ad0a6>] do_try_to_free_pages+0x196/0x4e0
[<ffffffff811ad627>] try_to_free_mem_cgroup_pages+0xb7/0x1a0
[<ffffffff81205a95>] try_charge+0x185/0x670
[<ffffffff81209987>] __memcg_kmem_charge_memcg+0x97/0xd0
[<ffffffff811f2bec>] new_slab+0x44c/0x480
[<ffffffff811f4af2>] ___slab_alloc+0x332/0x4c0
[<ffffffff811f4ca0>] __slab_alloc+0x20/0x40
[<ffffffff811f568f>] __kmalloc+0x1bf/0x230
[<ffffffffa04ff361>] nfs_generic_pgio+0x231/0x310 [nfs]
[<ffffffffa04ff4b1>] nfs_generic_pg_pgios+0x71/0xe0 [nfs]
[<ffffffffa04fef7d>] nfs_pageio_doio+0x2d/0x60 [nfs]
[<ffffffffa04ffbca>] __nfs_pageio_add_request+0xba/0x470 [nfs]
[<ffffffffa050052c>] nfs_pageio_add_request+0xac/0x1d0 [nfs]
[<ffffffffa0503ecb>] nfs_do_writepage+0xcb/0x1d0 [nfs]
[<ffffffffa0503fe5>] nfs_writepages_callback+0x15/0x30 [nfs]
[<ffffffff811a0f29>] write_cache_pages+0x1f9/0x4f0
[<ffffffffa0504399>] nfs_writepages+0xb9/0x150 [nfs]
[<ffffffff811a37de>] do_writepages+0x3e/0xa0
[<ffffffff81197946>] __filemap_fdatawrite_range+0xd6/0x120
[<ffffffff81197aab>] filemap_write_and_wait_range+0x2b/0x80
[<ffffffffa054d092>] nfs4_file_fsync+0x72/0x180 [nfsv4]
[<ffffffff8124c86f>] vfs_fsync_range+0x4f/0xb0
[<ffffffff8124c8ec>] vfs_fsync+0x1c/0x20
[<ffffffffa054d55e>] nfs4_file_flush+0x5e/0x80 [nfsv4]
[<ffffffff81214da7>] filp_close+0x37/0x70
[<ffffffff8123674f>] __close_fd+0x9f/0xd0
[<ffffffff81214e01>] SyS_close+0x21/0x50
[<ffffffff817344e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a
[<ffffffffffffffff>] 0xffffffffffffffff
造成该问题的原因有两点:
- 在每次Writeback操作中,也就是writepages回调,文件系统都会刻意的多积累一些Page,让单次写IO的size更大些,以此来提高性能;NFS自然也由此机制;
- 但是,在以上内存稀缺的场景中,出现了以下死锁场景:
set writeback page A set writeback page B allocate some memory -> shrink_page_list() -> wait writeback on page A DEAD LOCK !!!
set writeback
这里我们不去深入writeabck子系统,而只是看下,为什么会有writeback这个page的状态,参考如下:
filemap_fault() // fault path
filemap_update_page() // sync read path
-> lock_page()
-> filemap_read_page()
write_cache_pages()
-> lock_page()
-> clear_page_dirty_for_io()
-> (*writepage)(page, wbc, data);
-> set_page_writeback(page);
-> unlock_page(page);
page_endio()
---
if (!is_write) {
...
SetPageUptodate(page);
unlock_page(page);
} else {
...
end_page_writeback(page);
}
---
generic_perform_write()
-> ....
-> block_write_end()
-> __block_commit_write()
-> set_page_dirty()
do_shared_fault()
-> ...
-> block_page_mkwrite()
=> set_page_dirty()
set_page_dirty()
-> __set_page_dirty_nobuffers()
-> __mark_inode_dirty()
一个page的状态更新大致如下:
- not uptodate,一个新创建的page状态,
- lock,将这个page锁住,并触发readpage,参考filemap_read_page()
- read io, device write to page,这期间,因为设备会向page写入数据,所以这个page被禁止访问,因为会读出不完整的数据,
- unlock and uptodate,在读IO完成之后,page会被unlock,然后设置为uptodate,参考page_endio()
- dirty, hand to writeback,当写操作之后,page会被设置为dirty,并被交给writeback子系统,参考set_page_dirty()以及__mark_inode_dirty()
- lock,clear dirty, set writeback,unlock,此过程参考write_cache_pages();
- write io, device read from page,在写操作期间,page的unlock的,只是设置了writeback标记
- end page writeback,参考end_page_writeback();
在page发给设备的这段时间,它的状态就是writeback;
那么,这里有以下问题:
- 为什么不像read io那样直接page lock ?这样做会导致读操作被block,毕竟,dirty page是uptodate的;甚至于,在writepages累计page的过程中,开发者们都不愿意保持page lock,而是在设置为writeback之后,立即就unlock了;
- 为什么不使用dirty标记 ?区分出dirty和writeback,是为了避免page被重复送给writeback子系统;
wait on writeback
有两个经典场景需要等待writeback,第一个:
block_write_begin()
-> grab_cache_page_write_begin()
-> wait_for_stable_page()
wait_for_stable_page()
---
page = thp_head(page);
if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
wait_on_page_writeback(page);
---
如果在写IO期间,不允许再对page进行写操作,那么,会进入以上路径,等待写IO完成;例如,需要在IO期间生成数据校验码,此时需要保持数据完整;
另外一个场景,则与本问题相关:
shrink_page_list()
---
if (PageWriteback(page)) {
/* Case 1 above */
if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
stat->nr_immediate++;
goto activate_locked;
/* Case 2 above */
} else if (writeback_throttling_sane(sc) ||
!PageReclaim(page) || !may_enter_fs) {
SetPageReclaim(page);
stat->nr_writeback++;
goto activate_locked;
/* Case 3 above */
} else {
unlock_page(page);
wait_on_page_writeback(page);
/* then go back and try same page again */
list_add_tail(&page->lru, page_list);
continue;
}
}
---
也就是Case 3,代码的注释也对该场景进行了解释:
* 3) Legacy memcg encounters a page that is already marked
* PageReclaim. memcg does not have any dirty pages
* throttling so we could easily OOM just because too many
* pages are in writeback and there is nothing else to
* reclaim. Wait for the writeback to complete.
cgroup v1的mem cgroup没有限制进程产生dirty page的能力;
那么谁有呢?root mem cgroup或者cgroup v2,
这个限制能力就是:balance_dirty_page_ratelimited(),它可以限制系统或者cgroup v2 memcg的dirty page的比例或者数量,即/proc/sys/vm/dirty_ratio及/proc/sys/vm/dirty_bytes,当达到阈值时,执行写操作的进程会被强制sleep特定时间,使其make page dirty的速度与存储设备写IO带宽相近,进而使dirty page数量不再继续增加。具体可参考:Linux 内存管理_workingset内存-CSDN博客 “Dirty Throttle”一节
注:至于为什么cgroup v1不行,这里涉及cgroup writeabck,后者需要从mem cgroup到block cgroup的强绑定关系;在cgroup v1中,mem和blkio是两棵单独的目录树,而cgroup v2,两者是在同一个目录树中。当然,在实践中,我们使用cgroup v1,cgroup在mem和blkio两棵树种也是一一对应的,所以,可以在内核中做一些修改,完成两者的绑定,进而让cgroup v1也具备cgroup writback的能力。
那么为什么cgroup writeback需要blkio和mem cgroup的绑定呢?
因为writeback通常由后台的kworker kthread进行,它只能获得page,而page中只有memcg指针;在向下发送IO时,需要通过memcg找到对应的blkio cgroup。
当shrink_page_list()通过PageReclaim标记,发现所有的page已经被轮了一圈, 就会认为脏页太多,且存储设备跟不上了,为了避免OOM,就让任务在这等一下writeback完成,进而减缓dirty page的速度。
NFS的修复
到这里,问题的修复也比较明朗了,就是通过去掉gfp flags中的__GFP_FS标记,避免进入到相关分支,主线也确实是这么修复的:
ae97aa524ef495b6276fd26f5d5449fb22975d7c NFS: Use GFP_NOIO for two allocations in writeback
875bc3fbf2724134234ddb3069c8e9862b0b19b3 NFS: Ensure NFS writeback allocations don't recurse back into NFS.
第二个fix甚至使用memalloc_nofs_save()直接让所有的nfs writeback路径都避免了__GFP_FS标记;然而,同一年,又出现另外一个commit,
commit 2b17d725f9be59a1bfa0583af690c463fca1f385
Author: Trond Myklebust <trond.myklebust@hammerspace.com>
Date: Tue Jun 11 16:49:52 2019 -0400
NFS: Clean up writeback code
Now that the VM promises never to recurse back into the filesystem
layer on writeback, remove all the GFP_NOFS references etc from
the generic writeback code.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
将以上两个commit全部给去掉了;其commit中说道:VM promises never to recurse back into the filesystem layer on writeback;
之后,我遍历了writeback和relcaim相关的代码,也没有找到这里的提到的保证;于是我尝试在5.10版本上,复现这个问题,为了加快复现速度,对代码做了多次修改,最终可以复现的版本如下:
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 17fef6e..b5315df 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -902,6 +902,8 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
struct nfs_page_array *pg_array = &hdr->page_array;
unsigned int pagecount, pageused;
gfp_t gfp_flags = nfs_io_gfp_mask();
+ struct page *page[10];
+ int i;
pagecount = nfs_page_array_len(mirror->pg_base, mirror->pg_count);
pg_array->npages = pagecount;
@@ -918,6 +920,9 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
}
}
+ for (i = 0; i < 10; i++) {
+ page[i] = alloc_page(gfp_flags | __GFP_ACCOUNT);
+ }
nfs_init_cinfo(&cinfo, desc->pg_inode, desc->pg_dreq);
pages = hdr->page_array.pagevec;
last_page = NULL;
@@ -933,6 +938,12 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
*pages++ = last_page = req->wb_page;
}
}
+
+ for (i = 0; i< 10; i++) {
+ if (page[i])
+ put_page(page[i]);
+ }
+
if (WARN_ON_ONCE(pageused != pagecount)) {
nfs_pgio_error(hdr);
desc->pg_error = -EINVAL;
而影响复现结果的关键是:__GFP_ACCOUNT,
至少在我使用的这个版本上,并没有相关的保证;造成问题无法复现的原因是:
__alloc_pages()
---
...
if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&
unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
__free_pages(page, order);
page = NULL;
}
...
---
如果没有__GFP_ACCOUNT标记,那么这个路径就不会进入memcg reclaim,于是即使触发内存回收,在shrink_page_list()也会被认为是root cgroup,也就不会进入wait_on_page_writeback();而在问题可以复现的4.4版本,代码如下:
alloc_slab_page()
---
flags |= __GFP_NOTRACK;
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
page = __alloc_pages_node(node, flags, order);
if (page && memcg_charge_slab(page, flags, order, s)) {
__free_pages(page, order);
page = NULL;
}
---
这里会无条件进入到memcg_charge_slab();
这里的取别是__GFP_ACCOUNT替换__GFP_NOACCOUNT。
复现的log如下:
Tick 1 Task 3954 dd in D state
--------------------------------------------
[<0>] wait_on_page_bit+0x11f/0x330
[<0>] wait_on_page_writeback+0x25/0xb0
[<0>] shrink_page_list+0xb3f/0x18b0
[<0>] shrink_inactive_list+0x200/0x3c0
[<0>] shrink_lruvec+0x227/0x310
[<0>] shrink_node_memcgs+0x10d/0x1f0
[<0>] shrink_node+0x19a/0x530
[<0>] shrink_zones+0x7c/0x210
[<0>] do_try_to_free_pages+0x65/0x230
[<0>] try_to_free_mem_cgroup_pages+0x10b/0x1e0
[<0>] try_charge+0x1ef/0x6b0
[<0>] obj_cgroup_charge_pages+0x3b/0x130
[<0>] __memcg_kmem_charge_page+0xf4/0x240
[<0>] __alloc_pages+0x259/0x13a0
[<0>] nfs_generic_pgio+0xf0/0x480 [nfs]
[<0>] nfs_generic_pg_pgios+0x4e/0xb0 [nfs]
[<0>] nfs_pageio_doio+0x3e/0x90 [nfs]
[<0>] nfs_pageio_complete+0x79/0x130 [nfs]
[<0>] nfs_writepages+0x129/0x170 [nfs]
[<0>] do_writepages+0x34/0xc0
[<0>] __filemap_fdatawrite_range+0xcd/0x110
[<0>] filemap_write_and_wait_range+0x26/0x70
[<0>] nfs_wb_all+0x25/0x100 [nfs]
[<0>] filp_close+0x32/0x70
[<0>] __x64_sys_close+0x1e/0x60
[<0>] do_syscall_64+0x40/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x62/0xc7
所以,似乎并没有不进入writeback的保证,而是目前nfs writeback路径下,没有使用__GFP_ACCOUNT。
该问题我会进一步查证。