git://git.whamcloud.com - fs/lustre-release.git/commit

LU-16713 llite: writeback/commit pages under memory pressure

Lustre buffered I/O does not work well with restrictive memcg
control. This may result in OOM when the system is under memroy
pressure.

Lustre has implemented unstable pages support similar to NFS.
But it is disabled by default due to the performance reason.

In Lustre, a client pins the cache pages for writes until the
write transcation is committed on the server (OST) even these
pinned pages have been finished writeback. The server starts
a transaction commit either because the commit interval (5
second, by default) for the backend storage (i.e. OST/ldiskfs)
has been reached or there is not enough room in the journal
for a particular handle to start. Before the write transcation
has been committed and notify the client, these pages are
pinned and not flushable in any way by the kernel.
This means that when a client hits memory pressure there can
be a large number of unfreeable (pinned and uncommitted) pages,
so the application on the client will end up OOM killed because
when asked to free up memory it can not.
This is particularly common with cgroups. Because when cgroups
are in use, the memory limit is generally much lower than the
total system memory limits and it is more likely to reach the
limits.

Linux kernel has matured memory reclaim mechanism to avoid OOM
even with cgroups.
After perform dirtied write for a page, the kernel calls
@balance_dirty_pages(). If the dirtied and uncommitted pages
are over background threshold for the global memory limits or
memory cgroup limits, the writeback threads are woken to perform
some writeout.
When allocate a new page for I/O under memory pressure, the
kernel will try direct reclaim and then allocating. For cgroup,
it will try to reclaim pages from the memory cgroup over soft
limit. The slow page allocation path with direct reclaim will
call @wakeup_flusher_threads() with WB_REASON_VMSCAN to start
writeback dirty pages.

Our solution uses the page reclaim mechanism in the kernel
directly.
In the completion of page writeback (in @brw_interpret), call
@__mark_inode_dirty() to add this dirty inode which has pinned
uncommitted pages into the @bdi_writeback where each memory
cgroup has itw own @bdi_writeback to contorl the writeback for
buffered writes within it.
Thus under memory pressure, the writeback threads will be woken
up, and it will call @ll_writepages() to write out data.
For background writeout (over background dirty threshold) or
writeback with WB_REASON_VMSCAN for direct reclaim, we first
flush dirtied pages to OSTs and then sync them to OSTs and force
to commit these pages to release them quickly.

When a cgroup is under memory pressure, the kernel asks to do
writeback and then it does a fsync to OSTs. This will commit
uncommitted/unstable pages, and then the kernel can free them
finally.

In the following, we will give out some performance results.
The client has 512G memory in total.
1. dd if=/dev/zero of=$test bs=1M count=$size
I/O size 128G 256G 512G 1024G
unpatch (GB/s) 2.2 2.2 2.1 2.0
patched (GB/s) 2.2 2.2 2.1 2.0
There is no preformance regession after enable unstable page
account with the patch.

2. One process under different memcg limits and total I/O
size varies from 2X memlimit to 0.5 memlimit:
dd if=/dev/zero of=$file bs=1M count=$((memlimit_mb * time))
memcg limits 1G 4G 16G 64G
2X memlimit (GB/s) 1.7 1.6 1.8 1.7
1X memlimit (GB/s) 1.9 1.9 2.2 2.2
.5X memlimit(GB/s) 2.3 2.3 2.2 2.3
Without this patch, dd with I/O size > memcg limit will be
OOM-killed.

3. Multiple cgroups Testing:
8 cgroups in total each with memory limit of 8G.
Run dd write on each cgrop with I/O size of 2X memory limit
(16G).
17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s

4. Two dd writers one (A) is under memcg control and another
(B) is not. The total write data is 128G. Memcg limits varies
from 1G to 128G.
cmd: ./t2p.sh $memlimit_mb
memlimit dd writer (A) dd writer (B)
1G 1.3GB/s 2.2GB/s
4G 1.3GB/s 2.2GB/s
16G 1.4GB/s 2.2GB/s
32G 1.5GB/s 2.2GB/s
64G 1.8GB/s 2.2GB/s
128G 2.1GB/s 2.1GB/s

The results demonstrates that the process with memcg limits
nearly has no impact on the performance of the process without
limits.

Test-Parameters: clientdistro=el8.7 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10
Test-Parameters: clientdistro=el9.1 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10
Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: I7b548dcc214995c9f00d57817028ec64fd917eab
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50544
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>

author	Qian Yingjin <qian@ddn.com>
	Tue, 6 Jun 2023 08:11:30 +0000 (15:11 +0700)
committer	Oleg Drokin <green@whamcloud.com>
	Tue, 26 Sep 2023 14:33:35 +0000 (14:33 +0000)
commit	8aa231a994683a9224d42c0e7ae48aaebe2f583c
tree	0fba1c1cda594897c69edfb3279daf1639c67469	tree \| snapshot
parent	2ddb1d33245c23c4cafe64fb917323bdf567c81f	commit \| diff

lustre/autoconf/lustre-core.m4		diff \| blob \| history
lustre/include/cl_object.h		diff \| blob \| history
lustre/include/lustre_compat.h		diff \| blob \| history
lustre/llite/file.c		diff \| blob \| history
lustre/llite/llite_lib.c		diff \| blob \| history
lustre/llite/rw.c		diff \| blob \| history
lustre/llite/vvp_object.c		diff \| blob \| history
lustre/mdc/mdc_dev.c		diff \| blob \| history
lustre/obdclass/cl_object.c		diff \| blob \| history
lustre/obdclass/cl_page.c		diff \| blob \| history
lustre/osc/osc_io.c		diff \| blob \| history
lustre/osc/osc_request.c		diff \| blob \| history
lustre/tests/sanity.sh		diff \| blob \| history