From 79950e5d4c356fd4452c61f1097d53d30e779e7b Mon Sep 17 00:00:00 2001
From: Qian Yingjin <qian@ddn.com>
Date: Tue, 6 Jun 2023 15:11:30 +0700
Subject: [PATCH] LU-16713 llite: writeback/commit pages under memory pressure

Lustre buffered I/O does not work well with restrictive memcg
control. This may result in OOM when the system is under memroy
pressure.

Lustre has implemented unstable pages support similar to NFS.
But it is disabled by default due to the performance reason.

In Lustre, a client pins the cache pages for writes until the
write transcation is committed on the server (OST) even these
pinned pages have been finished writeback. The server starts
a transaction commit either because the commit interval (5
second, by default) for the backend storage (i.e. OST/ldiskfs)
has been reached or there is not enough room in the journal
for a particular handle to start. Before the write transcation
has been committed and notify the client, these pages are
pinned and not flushable in any way by the kernel.
This means that when a client hits memory pressure there can
be a large number of unfreeable (pinned and uncommitted) pages,
so the application on the client will end up OOM killed because
when asked to free up memory it can not.
This is particularly common with cgroups. Because when cgroups
are in use, the memory limit is generally much lower than the
total system memory limits and it is more likely to reach the
limits.

Linux kernel has matured memory reclaim mechanism to avoid OOM
even with cgroups.
After perform dirtied write for a page, the kernel calls
@balance_dirty_pages(). If the dirtied and uncommitted pages
are over background threshold for the global memory limits or
memory cgroup limits, the writeback threads are woken to perform
some writeout.
When allocate a new page for I/O under memory pressure, the
kernel will try direct reclaim and then allocating. For cgroup,
it will try to reclaim pages from the memory cgroup over soft
limit. The slow page allocation path with direct reclaim will
call @wakeup_flusher_threads() with WB_REASON_VMSCAN to start
writeback dirty pages.

Our solution uses the page reclaim mechanism in the kernel
directly.
In the completion of page writeback (in @brw_interpret), call
@__mark_inode_dirty() to add this dirty inode which has pinned
uncommitted pages into the @bdi_writeback where each memory
cgroup has itw own @bdi_writeback to contorl the writeback for
buffered writes within it.
Thus under memory pressure, the writeback threads will be woken
up, and it will call @ll_writepages() to write out data.
For background writeout (over background dirty threshold) or
writeback with WB_REASON_VMSCAN for direct reclaim, we first
flush dirtied pages to OSTs and then sync them to OSTs and force
to commit these pages to release them quickly.

When a cgroup is under memory pressure, the kernel asks to do
writeback and then it does a fsync to OSTs. This will commit
uncommitted/unstable pages, and then the kernel can free them
finally.

In the following, we will give out some performance results.
The client has 512G memory in total.
1. dd if=/dev/zero of=$test bs=1M count=$size
I/O size	128G	256G	512G	1024G
unpatch (GB/s)	2.2	2.2	2.1	2.0
patched (GB/s)	2.2	2.2	2.1	2.0
There is no preformance regession after enable unstable page
account with the patch.

2. One process under different memcg limits and total I/O
size varies from 2X memlimit to 0.5 memlimit:
dd if=/dev/zero of=$file bs=1M count=$((memlimit_mb * time))
memcg limits		1G	4G	16G	64G
2X memlimit (GB/s)	1.7	1.6	1.8	1.7
1X memlimit (GB/s)	1.9	1.9	2.2	2.2
.5X memlimit(GB/s)	2.3	2.3	2.2	2.3
Without this patch, dd with I/O size > memcg limit will be
OOM-killed.

3. Multiple cgroups Testing:
8 cgroups in total each with memory limit of 8G.
Run dd write on each cgrop with I/O size of 2X memory limit
(16G).
17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s
17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s

4. Two dd writers one (A) is under memcg control and another
(B) is not. The total write data is 128G. Memcg limits varies
from 1G to 128G.
cmd: ./t2p.sh $memlimit_mb
memlimit	dd writer (A)	dd writer (B)
1G		1.3GB/s		2.2GB/s
4G		1.3GB/s		2.2GB/s
16G		1.4GB/s		2.2GB/s
32G		1.5GB/s		2.2GB/s
64G		1.8GB/s		2.2GB/s
128G		2.1GB/s		2.1GB/s

The results demonstrates that the process with memcg limits
nearly has no impact on the performance of the process without
limits.

Lustre-change: https://review.whamcloud.com/50544
Lustre-commit: 8aa231a994683a9224d42c0e7ae48aaebe2f583c

Test-Parameters: clientdistro=el8.7 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10
Test-Parameters: clientdistro=el9.1 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10
Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: I7b548dcc214995c9f00d57817028ec64fd917eab
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/52527
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
---
 lustre/autoconf/lustre-core.m4 | 23 ++++++++++
 lustre/include/cl_object.h     | 19 +++++++--
 lustre/include/lustre_compat.h |  4 ++
 lustre/llite/file.c            |  3 +-
 lustre/llite/llite_lib.c       | 15 +------
 lustre/llite/rw.c              | 32 ++++++++++++++
 lustre/llite/vvp_object.c      |  8 ++++
 lustre/mdc/mdc_dev.c           | 26 +++++++++---
 lustre/obdclass/cl_object.c    | 19 +++++++++
 lustre/obdclass/cl_page.c      |  3 +-
 lustre/osc/osc_io.c            | 54 +++++++++++++++++-------
 lustre/osc/osc_request.c       | 12 +++++-
 lustre/tests/sanity-sec.sh     | 74 +++++++++++++++++++-------------
 lustre/tests/sanity.sh         | 95 ++++++++++++++++++++++++++++++++++++++++--
 14 files changed, 311 insertions(+), 76 deletions(-)

diff --git a/lustre/autoconf/lustre-core.m4 b/lustre/autoconf/lustre-core.m4
index 4031274..f98b7b6 100644
--- a/lustre/autoconf/lustre-core.m4
+++ b/lustre/autoconf/lustre-core.m4
@@ -3046,6 +3046,28 @@ LB_CHECK_EXPORT([delete_from_page_cache], [mm/filemap.c],
 			[delete_from_page_cache is exported])])
 ]) # LC_EXPORTS_DELETE_FROM_PAGE_CACHE
 
+
+#
+# LC_HAVE_WB_STAT_MOD
+#
+# Kernel 5.16-rc1 bd3488e7b4d61780eb3dfaca1cc6f4026bcffd48
+# mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
+#
+AC_DEFUN([LC_HAVE_WB_STAT_MOD], [
+tmp_flags="$EXTRA_KCFLAGS"
+EXTRA_KCFLAGS="-Werror"
+LB_CHECK_COMPILE([if wb_stat_mod() exists],
+wb_stat_mode, [
+	#include <linux/backing-dev.h>
+],[
+	wb_stat_mod(NULL, WB_WRITEBACK, 1);
+],[
+	AC_DEFINE(HAVE_WB_STAT_MOD, 1,
+		[wb_stat_mod() exists])
+])
+EXTRA_KCFLAGS="$tmp_flags"
+]) # LC_HAVE_WB_STAT_MOD
+
 #
 # LC_HAVE_INVALIDATE_FOLIO
 #
@@ -4050,6 +4072,7 @@ AC_DEFUN([LC_PROG_LINUX_RESULTS], [
 	# 5.16
 	LC_HAVE_KIOCB_COMPLETE_2ARGS
 	LC_EXPORTS_DELETE_FROM_PAGE_CACHE
+	LC_HAVE_WB_STAT_MOD
 
 	# 5.17
 	LC_HAVE_INVALIDATE_FOLIO
diff --git a/lustre/include/cl_object.h b/lustre/include/cl_object.h
index 7a9182e..ae4de53 100644
--- a/lustre/include/cl_object.h
+++ b/lustre/include/cl_object.h
@@ -382,6 +382,14 @@ struct cl_object_operations {
          */
 	int (*coo_attr_update)(const struct lu_env *env, struct cl_object *obj,
 			       const struct cl_attr *attr, unsigned valid);
+	/**
+	 * Mark the inode dirty. By this way, the inode will add into the
+	 * writeback list of the corresponding @bdi_writeback, and then it will
+	 * defer to write out the dirty pages to OSTs via the kernel writeback
+	 * mechanism.
+	 */
+	void (*coo_dirty_for_sync)(const struct lu_env *env,
+				   struct cl_object *obj);
         /**
          * Update object configuration. Called top-to-bottom to modify object
          * configuration.
@@ -1795,14 +1803,16 @@ enum cl_io_lock_dmd {
 
 enum cl_fsync_mode {
 	/** start writeback, do not wait for them to finish */
-	CL_FSYNC_NONE  = 0,
+	CL_FSYNC_NONE		= 0,
 	/** start writeback and wait for them to finish */
-	CL_FSYNC_LOCAL = 1,
+	CL_FSYNC_LOCAL		= 1,
 	/** discard all of dirty pages in a specific file range */
-	CL_FSYNC_DISCARD = 2,
+	CL_FSYNC_DISCARD	= 2,
 	/** start writeback and make sure they have reached storage before
 	 * return. OST_SYNC RPC must be issued and finished */
-	CL_FSYNC_ALL   = 3
+	CL_FSYNC_ALL		= 3,
+	/** start writeback, thus the kernel can reclaim some memory */
+	CL_FSYNC_RECLAIM	= 4,
 };
 
 struct cl_io_rw_common {
@@ -2241,6 +2251,7 @@ int  cl_object_attr_get(const struct lu_env *env, struct cl_object *obj,
 			struct cl_attr *attr);
 int  cl_object_attr_update(const struct lu_env *env, struct cl_object *obj,
                            const struct cl_attr *attr, unsigned valid);
+void cl_object_dirty_for_sync(const struct lu_env *env, struct cl_object *obj);
 int  cl_object_glimpse    (const struct lu_env *env, struct cl_object *obj,
                            struct ost_lvb *lvb);
 int  cl_conf_set          (const struct lu_env *env, struct cl_object *obj,
diff --git a/lustre/include/lustre_compat.h b/lustre/include/lustre_compat.h
index f6b951e..6df1bc0e 100644
--- a/lustre/include/lustre_compat.h
+++ b/lustre/include/lustre_compat.h
@@ -573,6 +573,10 @@ static inline int ll_vfs_removexattr(struct dentry *dentry, struct inode *inode,
 # endif
 #endif
 
+#ifdef HAVE_WB_STAT_MOD
+#define __add_wb_stat(wb, item, amount)		wb_stat_mod(wb, item, amount)
+#endif
+
 #ifdef HAVE_SEC_RELEASE_SECCTX_1ARG
 #ifndef HAVE_LSMCONTEXT_INIT
 /* Ubuntu 5.19 */
diff --git a/lustre/llite/file.c b/lustre/llite/file.c
index 8579f0b..e53b9cf 100644
--- a/lustre/llite/file.c
+++ b/lustre/llite/file.c
@@ -4948,7 +4948,8 @@ int cl_sync_file_range(struct inode *inode, loff_t start, loff_t end,
 	ENTRY;
 
 	if (mode != CL_FSYNC_NONE && mode != CL_FSYNC_LOCAL &&
-	    mode != CL_FSYNC_DISCARD && mode != CL_FSYNC_ALL)
+	    mode != CL_FSYNC_DISCARD && mode != CL_FSYNC_ALL &&
+	    mode != CL_FSYNC_RECLAIM)
 		RETURN(-EINVAL);
 
 	env = cl_env_get(&refcheck);
diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c
index 22127bc..f0598ee 100644
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -1374,8 +1374,8 @@ void ll_put_super(struct super_block *sb)
 	struct ll_sb_info *sbi = ll_s2sbi(sb);
 	char *profilenm = get_profile_name(sb);
 	unsigned long cfg_instance = ll_get_cfg_instance(sb);
-	long ccc_count;
-	int next, force = 1, rc = 0;
+	int next, force = 1;
+
 	ENTRY;
 
 	if (IS_ERR(sbi))
@@ -1397,17 +1397,6 @@ void ll_put_super(struct super_block *sb)
 			force = obd->obd_force;
 	}
 
-	/* Wait for unstable pages to be committed to stable storage */
-	if (force == 0) {
-		rc = l_wait_event_abortable(
-			sbi->ll_cache->ccc_unstable_waitq,
-			atomic_long_read(&sbi->ll_cache->ccc_unstable_nr) == 0);
-	}
-
-	ccc_count = atomic_long_read(&sbi->ll_cache->ccc_unstable_nr);
-	if (force == 0 && rc != -ERESTARTSYS)
-		LASSERTF(ccc_count == 0, "count: %li\n", ccc_count);
-
 	/* We need to set force before the lov_disconnect in
 	 * lustre_common_put_super, since l_d cleans up osc's as well.
 	 */
diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
index f628905..00b4d3d 100644
--- a/lustre/llite/rw.c
+++ b/lustre/llite/rw.c
@@ -1569,6 +1569,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	enum cl_fsync_mode mode;
 	int range_whole = 0;
 	int result;
+
 	ENTRY;
 
 	if (wbc->range_cyclic) {
@@ -1587,6 +1588,37 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		mode = CL_FSYNC_LOCAL;
 
+	if (wbc->sync_mode == WB_SYNC_NONE) {
+#ifdef SB_I_CGROUPWB
+		struct bdi_writeback *wb;
+
+		/*
+		 * As it may break full stripe writes on the inode,
+		 * disable periodic kupdate writeback (@wbc->for_kupdate)?
+		 */
+
+		/*
+		 * The system is under memory pressure and it is now reclaiming
+		 * cache pages.
+		 */
+		wb = inode_to_wb(inode);
+		if (wbc->for_background ||
+		    (wb->start_all_reason == WB_REASON_VMSCAN &&
+		     test_bit(WB_start_all, &wb->state)))
+			mode = CL_FSYNC_RECLAIM;
+#else
+		/*
+		 * We have no idea about writeback reason for memory reclaim
+		 * WB_REASON_TRY_TO_FREE_PAGES in the old kernel such as rhel7
+		 * (WB_REASON_VMSCAN in the newer kernel) ...
+		 * Here set mode with CL_FSYNC_RECLAIM forcely on the old
+		 * kernel.
+		 */
+		if (!wbc->for_kupdate)
+			mode = CL_FSYNC_RECLAIM;
+#endif
+	}
+
 	if (ll_i2info(inode)->lli_clob == NULL)
 		RETURN(0);
 
diff --git a/lustre/llite/vvp_object.c b/lustre/llite/vvp_object.c
index ea03a84..d26a984 100644
--- a/lustre/llite/vvp_object.c
+++ b/lustre/llite/vvp_object.c
@@ -128,6 +128,13 @@ static int vvp_attr_update(const struct lu_env *env, struct cl_object *obj,
 	return 0;
 }
 
+static void vvp_dirty_for_sync(const struct lu_env *env, struct cl_object *obj)
+{
+	struct inode *inode = vvp_object_inode(obj);
+
+	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+}
+
 static int vvp_conf_set(const struct lu_env *env, struct cl_object *obj,
 			const struct cl_object_conf *conf)
 {
@@ -288,6 +295,7 @@ static const struct cl_object_operations vvp_ops = {
 	.coo_io_init      = vvp_io_init,
 	.coo_attr_get     = vvp_attr_get,
 	.coo_attr_update  = vvp_attr_update,
+	.coo_dirty_for_sync = vvp_dirty_for_sync,
 	.coo_conf_set     = vvp_conf_set,
 	.coo_prune        = vvp_prune,
 	.coo_glimpse      = vvp_object_glimpse,
diff --git a/lustre/mdc/mdc_dev.c b/lustre/mdc/mdc_dev.c
index 7bfe97f..2b4d667 100644
--- a/lustre/mdc/mdc_dev.c
+++ b/lustre/mdc/mdc_dev.c
@@ -1171,6 +1171,16 @@ int mdc_io_fsync_start(const struct lu_env *env,
 
 	ENTRY;
 
+	if (fio->fi_mode == CL_FSYNC_RECLAIM) {
+		struct client_obd *cli = osc_cli(osc);
+
+		if (!atomic_long_read(&cli->cl_unstable_count)) {
+			/* Stop flush when there are no unstable pages? */
+			CDEBUG(D_CACHE, "unstable count is zero\n");
+			RETURN(0);
+		}
+	}
+
 	/* a MDC lock always covers whole object, do sync for whole
 	 * possible range despite of supplied start/end values.
 	 */
@@ -1180,19 +1190,25 @@ int mdc_io_fsync_start(const struct lu_env *env,
 		fio->fi_nr_written += result;
 		result = 0;
 	}
-	if (fio->fi_mode == CL_FSYNC_ALL) {
+	if (fio->fi_mode == CL_FSYNC_ALL || fio->fi_mode == CL_FSYNC_RECLAIM) {
+		struct osc_io *oio = cl2osc_io(env, slice);
+		struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
 		int rc;
 
-		rc = osc_cache_wait_range(env, osc, 0, CL_PAGE_EOF);
-		if (result == 0)
-			result = rc;
+		if (fio->fi_mode == CL_FSYNC_ALL) {
+			rc = osc_cache_wait_range(env, osc, 0, CL_PAGE_EOF);
+			if (result == 0)
+				result = rc;
+		}
 		/* Use OSC sync code because it is asynchronous.
 		 * It is to be added into MDC and avoid the using of
 		 * OST_SYNC at both MDC and MDT.
 		 */
 		rc = osc_fsync_ost(env, osc, fio);
-		if (result == 0)
+		if (result == 0) {
+			cbargs->opc_rpc_sent = 1;
 			result = rc;
+		}
 	}
 
 	RETURN(result);
diff --git a/lustre/obdclass/cl_object.c b/lustre/obdclass/cl_object.c
index 0c392db..e91f149 100644
--- a/lustre/obdclass/cl_object.c
+++ b/lustre/obdclass/cl_object.c
@@ -259,6 +259,25 @@ int cl_object_attr_update(const struct lu_env *env, struct cl_object *obj,
 EXPORT_SYMBOL(cl_object_attr_update);
 
 /**
+ * Mark the inode as dirty when the inode has uncommitted (unstable) pages.
+ * Thus when the system is under momory pressure, it will trigger writeback
+ * on background to commit and unpin the pages.
+ */
+void cl_object_dirty_for_sync(const struct lu_env *env, struct cl_object *top)
+{
+	struct cl_object *obj;
+
+	ENTRY;
+
+	cl_object_for_each(obj, top) {
+		if (obj->co_ops->coo_dirty_for_sync != NULL)
+			obj->co_ops->coo_dirty_for_sync(env, obj);
+	}
+	EXIT;
+}
+EXPORT_SYMBOL(cl_object_dirty_for_sync);
+
+/**
  * Notifies layers (bottom-to-top) that glimpse AST was received.
  *
  * Layers have to fill \a lvb fields with information that will be shipped
diff --git a/lustre/obdclass/cl_page.c b/lustre/obdclass/cl_page.c
index b179fc4..85319f6 100644
--- a/lustre/obdclass/cl_page.c
+++ b/lustre/obdclass/cl_page.c
@@ -1267,8 +1267,7 @@ struct cl_client_cache *cl_cache_init(unsigned long lru_page_max)
 	spin_lock_init(&cache->ccc_lru_lock);
 	INIT_LIST_HEAD(&cache->ccc_lru);
 
-	/* turn unstable check off by default as it impacts performance */
-	cache->ccc_unstable_check = 0;
+	cache->ccc_unstable_check = 1;
 	atomic_long_set(&cache->ccc_unstable_nr, 0);
 	init_waitqueue_head(&cache->ccc_unstable_waitq);
 	mutex_init(&cache->ccc_max_cache_mb_lock);
diff --git a/lustre/osc/osc_io.c b/lustre/osc/osc_io.c
index 97fc034..8f3d5c2 100644
--- a/lustre/osc/osc_io.c
+++ b/lustre/osc/osc_io.c
@@ -941,15 +941,26 @@ EXPORT_SYMBOL(osc_fsync_ost);
 int osc_io_fsync_start(const struct lu_env *env,
 		       const struct cl_io_slice *slice)
 {
-	struct cl_io       *io  = slice->cis_io;
+	struct cl_io *io = slice->cis_io;
 	struct cl_fsync_io *fio = &io->u.ci_fsync;
-	struct cl_object   *obj = slice->cis_obj;
-	struct osc_object  *osc = cl2osc(obj);
-	pgoff_t start  = cl_index(obj, fio->fi_start);
-	pgoff_t end    = cl_index(obj, fio->fi_end);
-	int     result = 0;
+	struct cl_object *obj = slice->cis_obj;
+	struct osc_object *osc = cl2osc(obj);
+	pgoff_t start = cl_index(obj, fio->fi_start);
+	pgoff_t end = cl_index(obj, fio->fi_end);
+	int result = 0;
+
 	ENTRY;
 
+	if (fio->fi_mode == CL_FSYNC_RECLAIM) {
+		struct client_obd *cli = osc_cli(osc);
+
+		if (!atomic_long_read(&cli->cl_unstable_count)) {
+			/* Stop flush when there are no unstable pages? */
+			CDEBUG(D_CACHE, "unstable count is zero\n");
+			RETURN(0);
+		}
+	}
+
 	if (fio->fi_end == OBD_OBJECT_EOF)
 		end = CL_PAGE_EOF;
 
@@ -959,20 +970,30 @@ int osc_io_fsync_start(const struct lu_env *env,
 		fio->fi_nr_written += result;
 		result = 0;
 	}
-	if (fio->fi_mode == CL_FSYNC_ALL) {
+	if (fio->fi_mode == CL_FSYNC_ALL || fio->fi_mode == CL_FSYNC_RECLAIM) {
+		struct osc_io *oio = cl2osc_io(env, slice);
+		struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
 		int rc;
 
 		/* we have to wait for writeback to finish before we can
 		 * send OST_SYNC RPC. This is bad because it causes extents
 		 * to be written osc by osc. However, we usually start
 		 * writeback before CL_FSYNC_ALL so this won't have any real
-		 * problem. */
-		rc = osc_cache_wait_range(env, osc, start, end);
-		if (result == 0)
-			result = rc;
+		 * problem.
+		 * We do not have to wait for waitback to finish in the memory
+		 * reclaim environment.
+		 */
+		if (fio->fi_mode == CL_FSYNC_ALL) {
+			rc = osc_cache_wait_range(env, osc, start, end);
+			if (result == 0)
+				result = rc;
+		}
+
 		rc = osc_fsync_ost(env, osc, fio);
-		if (result == 0)
+		if (result == 0) {
+			cbargs->opc_rpc_sent = 1;
 			result = rc;
+		}
 	}
 
 	RETURN(result);
@@ -982,16 +1003,17 @@ void osc_io_fsync_end(const struct lu_env *env,
 		      const struct cl_io_slice *slice)
 {
 	struct cl_fsync_io *fio = &slice->cis_io->u.ci_fsync;
-	struct cl_object   *obj = slice->cis_obj;
+	struct cl_object *obj = slice->cis_obj;
+	struct osc_io *oio = cl2osc_io(env, slice);
+	struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
 	pgoff_t start = cl_index(obj, fio->fi_start);
 	pgoff_t end   = cl_index(obj, fio->fi_end);
 	int result = 0;
 
 	if (fio->fi_mode == CL_FSYNC_LOCAL) {
 		result = osc_cache_wait_range(env, cl2osc(obj), start, end);
-	} else if (fio->fi_mode == CL_FSYNC_ALL) {
-		struct osc_io           *oio    = cl2osc_io(env, slice);
-		struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
+	} else if (cbargs->opc_rpc_sent && (fio->fi_mode == CL_FSYNC_ALL ||
+					    fio->fi_mode == CL_FSYNC_RECLAIM)) {
 
 		wait_for_completion(&cbargs->opc_sync);
 		if (result == 0)
diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c
index c7b9d50..75a3a36 100644
--- a/lustre/osc/osc_request.c
+++ b/lustre/osc/osc_request.c
@@ -2588,6 +2588,7 @@ static int brw_interpret(const struct lu_env *env,
 	struct osc_extent *tmp;
 	struct client_obd *cli = aa->aa_cli;
 	unsigned long transferred = 0;
+	struct cl_object *obj = NULL;
 
 	ENTRY;
 
@@ -2628,7 +2629,6 @@ static int brw_interpret(const struct lu_env *env,
 		struct obdo *oa = aa->aa_oa;
 		struct cl_attr *attr = &osc_env_info(env)->oti_attr;
 		unsigned long valid = 0;
-		struct cl_object *obj;
 		struct osc_async_page *last;
 		if (aa->aa_ncppga)
 			last = brw_page2oap(aa->aa_ncppga[aa->aa_ncpage_count - 1]);
@@ -2681,8 +2681,16 @@ static int brw_interpret(const struct lu_env *env,
 	OBD_SLAB_FREE_PTR(aa->aa_oa, osc_obdo_kmem);
 	aa->aa_oa = NULL;
 
-	if (lustre_msg_get_opc(req->rq_reqmsg) == OST_WRITE && rc == 0)
+	if (lustre_msg_get_opc(req->rq_reqmsg) == OST_WRITE && rc == 0) {
 		osc_inc_unstable_pages(req);
+		/*
+		 * If req->rq_committed is set, it means that the dirty pages
+		 * have already committed into the stable storage on OSTs
+		 * (i.e. Direct I/O).
+		 */
+		if (!req->rq_committed)
+			cl_object_dirty_for_sync(env, cl_object_top(obj));
+	}
 
 	list_for_each_entry_safe(ext, tmp, &aa->aa_exts, oe_link) {
 		list_del_init(&ext->oe_link);
diff --git a/lustre/tests/sanity-sec.sh b/lustre/tests/sanity-sec.sh
index cb8abf1..c26e8f9 100755
--- a/lustre/tests/sanity-sec.sh
+++ b/lustre/tests/sanity-sec.sh
@@ -2812,10 +2812,15 @@ insert_enc_key() {
 }
 
 remove_enc_key() {
-	cancel_lru_locks
+	local dummy_key
+
+	$LCTL set_param -n ldlm.namespaces.*.lru_size=clear
 	sync ; echo 3 > /proc/sys/vm/drop_caches
-	keyctl revoke $(keyctl show | awk '$7 ~ "^fscrypt:" {print $1}')
-	keyctl reap
+	dummy_key=$(keyctl show | awk '$7 ~ "^fscrypt:" {print $1}')
+	if [ -n "$dummy_key" ]; then
+		keyctl revoke $dummy_key
+		keyctl reap
+	fi
 }
 
 wait_ssk() {
@@ -2830,6 +2835,39 @@ wait_ssk() {
 	fi
 }
 
+remount_client_normally() {
+	# remount client without dummy encryption key
+	if is_mounted $MOUNT; then
+		umount_client $MOUNT || error "umount $MOUNT failed"
+	fi
+	mount_client $MOUNT ${MOUNT_OPTS} ||
+		error "remount failed"
+
+	if is_mounted $MOUNT2; then
+		umount_client $MOUNT2 || error "umount $MOUNT2 failed"
+	fi
+	if [ "$MOUNT_2" ]; then
+		mount_client $MOUNT2 ${MOUNT_OPTS} ||
+			error "remount failed"
+	fi
+
+	remove_enc_key
+	wait_ssk
+}
+
+remount_client_dummykey() {
+	insert_enc_key
+
+	# remount client with dummy encryption key
+	if is_mounted $MOUNT; then
+		umount_client $MOUNT || error "umount $MOUNT failed"
+	fi
+	mount_client $MOUNT ${MOUNT_OPTS},test_dummy_encryption ||
+		error "remount failed"
+
+	wait_ssk
+}
+
 setup_for_enc_tests() {
 	# remount client with test_dummy_encryption option
 	if is_mounted $MOUNT; then
@@ -2845,31 +2883,9 @@ setup_for_enc_tests() {
 }
 
 cleanup_for_enc_tests() {
-	local dummy_key
-
 	rm -rf $DIR/$tdir $*
 
-	# remount client normally
-	if is_mounted $MOUNT; then
-		umount_client $MOUNT || error "umount $MOUNT failed"
-	fi
-	mount_client $MOUNT ${MOUNT_OPTS} ||
-		error "remount failed"
-
-	if is_mounted $MOUNT2; then
-		umount_client $MOUNT2 || error "umount $MOUNT2 failed"
-	fi
-	if [ "$MOUNT_2" ]; then
-		mount_client $MOUNT2 ${MOUNT_OPTS} ||
-			error "remount failed"
-	fi
-
-	# remove fscrypt key from keyring
-	dummy_key=$(keyctl show | awk '$7 ~ "^fscrypt:" {print $1}')
-	if [ -n "$dummy_key" ]; then
-		keyctl revoke $dummy_key
-		keyctl reap
-	fi
+	remount_client_normally
 }
 
 cleanup_nodemap_after_enc_tests() {
@@ -3604,10 +3620,8 @@ test_46() {
 	fi
 	sync ; echo 3 > /proc/sys/vm/drop_caches
 
-	# remove fscrypt key from keyring
-	keyctl revoke $(keyctl show | awk '$7 ~ "^fscrypt:" {print $1}')
-	keyctl reap
-	cancel_lru_locks
+	# remount without dummy encryption key
+	remount_client_normally
 
 	# this is $testdir2
 	scrambleddir=$(find $DIR/$tdir/ -maxdepth 1 -mindepth 1 -inum $inum)
diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index 1b7ce6c..003b27a 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -69,7 +69,12 @@ fi
 # skip cgroup tests on RHEL8.1 kernels until they are fixed
 if (( $LINUX_VERSION_CODE >= $(version_code 4.18.0) &&
       $LINUX_VERSION_CODE <  $(version_code 5.4.0) )); then
-	always_except LU-13063 411
+	always_except LU-13063 411a
+fi
+
+# skip cgroup tests for kernels < v4.18.0
+if (( $LINUX_VERSION_CODE < $(version_code 4.18.0) )); then
+	always_except LU-13063 411b
 fi
 
 #                                  5              12     8   12  15   (min)"
@@ -26007,10 +26012,11 @@ run_test 410 "Test inode number returned from kernel thread"
 
 cleanup_test411_cgroup() {
 	trap 0
+	cat $1/memory.stat
 	rmdir "$1"
 }
 
-test_411() {
+test_411a() {
 	local cg_basedir=/sys/fs/cgroup/memory
 	# LU-9966
 	test -f "$cg_basedir/memory.kmem.limit_in_bytes" ||
@@ -26035,7 +26041,90 @@ test_411() {
 
 	return 0
 }
-run_test 411 "Slab allocation error with cgroup does not LBUG"
+run_test 411a "Slab allocation error with cgroup does not LBUG"
+
+test_411b() {
+	local cg_basedir=/sys/fs/cgroup/memory
+	# LU-9966
+	[ -e "$cg_basedir/memory.kmem.limit_in_bytes" ] ||
+		skip "no setup for cgroup"
+	$LFS setstripe -c 2 $DIR/$tfile || error "unable to setstripe"
+	# testing suggests we can't reliably avoid OOM with a 64M limit, but it
+	# seems reasonable to ask that we have at least 128M in the cgroup
+	local memlimit_mb=256
+
+	# Create a cgroup and set memory limit
+	# (tfile is used as an easy way to get a recognizable cgroup name)
+	local cgdir=$cg_basedir/$tfile
+	mkdir $cgdir || error "cgroup mkdir '$cgdir' failed"
+	stack_trap "cleanup_test411_cgroup $cgdir" EXIT
+	echo $((memlimit_mb * 1024 * 1024)) > $cgdir/memory.limit_in_bytes
+
+	echo "writing first file"
+	# Write a file 4x the memory limit in size
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile bs=1M count=$((memlimit_mb * 4))" ||
+		error "(1) failed to write successfully"
+
+	sync
+	cancel_lru_locks osc
+
+	rm -f $DIR/$tfile
+	$LFS setstripe -c 2 $DIR/$tfile || error "unable to setstripe"
+
+	# Try writing at a larger block size
+	# NB: if block size is >= 1/2 cgroup size, we sometimes get OOM killed
+	# so test with 1/4 cgroup size (this seems reasonable to me - we do
+	# need *some* memory to do IO in)
+	echo "writing at larger block size"
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile bs=64M count=$((memlimit_mb * 4 / 128))" ||
+		error "(3) failed to write successfully"
+
+	sync
+	cancel_lru_locks osc
+	rm -f $DIR/$tfile
+	$LFS setstripe -c 2 $DIR/$tfile.{1..4} || error "unable to setstripe"
+
+	# Try writing multiple files at once
+	echo "writing multiple files"
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile.1 bs=32M count=$((memlimit_mb * 4 / 64))" &
+	local pid1=$!
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile.2 bs=32M count=$((memlimit_mb * 4 / 64))" &
+	local pid2=$!
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile.3 bs=32M count=$((memlimit_mb * 4 / 64))" &
+	local pid3=$!
+	bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile.4 bs=32M count=$((memlimit_mb * 4 / 64))" &
+	local pid4=$!
+
+	wait $pid1
+	local rc1=$?
+	wait $pid2
+	local rc2=$?
+	wait $pid3
+	local rc3=$?
+	wait $pid4
+	local rc4=$?
+	if (( rc1 != 0)); then
+		error "error writing to file from $pid1"
+	fi
+	if (( rc2 != 0)); then
+		error "error writing to file from $pid2"
+	fi
+	if (( rc3 != 0)); then
+		error "error writing to file from $pid3"
+	fi
+	if (( rc4 != 0)); then
+		error "error writing to file from $pid4"
+	fi
+
+	sync
+	cancel_lru_locks osc
+
+	# These files can be large-ish (~1 GiB total), so delete them rather
+	# than leave for later cleanup
+	rm -f $DIR/$tfile.*
+	return 0
+}
+run_test 411b "confirm Lustre can avoid OOM with reasonable cgroups limits"
 
 test_412() {
 	(( $MDSCOUNT > 1 )) || skip_env "needs >= 2 MDTs"
-- 
1.8.3.1