git://git.whamcloud.com - fs/lustre-release.git/commit

LU-17872 ldlm: switch to read_positive in reclaim_full

Checking reclaim full for every lock request is expensive;
it requires taking a global spinlock and can completely
clog the MDS CPU on larger systems.

If we switch to read_positive rather than sum_positive for
our counter read, we avoid this spinlock at the cost of
being off by as much as NR_CPU*32.

Since the counter is for hundreds of thousands to millions
of items and just triggers memory reclaim, this level of
error is completely fine.

This resolves the contention issue, on an OCI system with
384 cores, here's our mdtest comparison:

Operation           | Without Patch | With Patch  | %Change
---------------------|---------------|-------------|-------
Directory creation  | 69481.994     | 64373.060   | -7%
Directory stat      | 87942.757     | 274670.454  | 212%
Directory rename    | 78127.922     | 92592.239   | 19%
Directory removal   | 69901.490     | 89560.415   | 28%
File creation       | 62789.774     | 107294.450  | 71%
File stat           | 88039.061     | 480469.711  | 446%
File read           | 82192.370     | 151117.380  | 84%
File removal        | 146690.828    | 127589.655  | -13%
Tree creation       | 46.549        | 56.992      | 22%
Tree removal        | 51.531        | 53.967      | 5%

Note the *446%* improvement in stat and the 70-80% in
file creation and read.

Note this issue is likely much worse on systems with higher
core counts since the cost of summing the counter scales
with the number of CPUs.  This may be why this has not been
seen before.

Lustre-change: https://review.whamcloud.com/55141
Lustre-commit: 0c16987b2233c32d775f0e3e6f6503c4b7825e02

Signed-off-by: Patrick Farrell <patrick.farrell@oracle.com>
Signed-off-by: Xing Huang <hxing@ddn.com>
Change-Id: I01a39abf5e6f0829156b413b1f44001e2c504be2
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: wangdi <di.d.wang@oracle.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/55479
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>

author	Patrick Farrell <paf0187@gmail.com>
	Wed, 29 May 2024 20:41:54 +0000 (16:41 -0400)
committer	Andreas Dilger <adilger@whamcloud.com>
	Wed, 3 Jul 2024 04:34:58 +0000 (04:34 +0000)
commit	6b3e91d5225d5861c00b29191a792e1f66a52782
tree	4bd890c989fee5588443fc05ba48e6e65c2c9cc8	tree \| snapshot
parent	1cc465430d297dcd14893e3b73354c5bf5ea8220	commit \| diff

lustre/ldlm/ldlm_lockd.c		diff \| blob \| history
lustre/ldlm/ldlm_reclaim.c		diff \| blob \| history