Whamcloud - gitweb
LU-14518 ptlrpc: avoid server STONITH for slow request 25/53225/11
authorAndreas Dilger <adilger@whamcloud.com>
Fri, 12 Mar 2021 22:32:45 +0000 (15:32 -0700)
committerOleg Drokin <green@whamcloud.com>
Sun, 8 Sep 2024 16:03:08 +0000 (16:03 +0000)
commitbc927408f63bc4b64a81f9f25c95445005bb8f66
tree4e4ae84dd8438c57ea3f44daffc971d29a776261
parent05af6c1b8ef31fe30863359c8c06431ebd159e9e
LU-14518 ptlrpc: avoid server STONITH for slow request

If a service is temporarily overloaded, request processing times
may exceed at_max for a short time.  We don't want to increase
at_max excessively, since that slows down client RPC resend and
recovery, but we also want to avoid server STONITH because at_max
is used directly by ptlrpc_svcpt_health_check() to determine if
the service reports "NOT HEALTHY" and forces HA takeover.

Slow request processing is not as serious as, say, an LBUG, so
allow a configurable parameter at_unhealthy_factor to allow
requests to exceed at_max before a service is considered unhealthy.
This defaults to 3x at_max, and 0 disables service health checks.

Also, importantly, it shouldn't be considered an error if the
RPC *requests* are waiting a long time (that can happen if the
server is overloaded, or NRS is delaying some RPCs), but only
if the *service* is unable to process *any* requests in a long
time.  Otherwise, it will only print a warning of delayed RPCs.

Add sanityn.sh test_200 to exercise related health check test

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ifaf4454efacf5f5ec8fc24f75a49e17e5a3ebbe5
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53225
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
12 files changed:
lustre/include/lustre_net.h
lustre/include/obd.h
lustre/include/obd_support.h
lustre/mdc/lproc_mdc.c
lustre/mdt/mdt_lproc.c
lustre/mgs/lproc_mgs.c
lustre/obdclass/class_obd.c
lustre/obdclass/obd_sysfs.c
lustre/ptlrpc/ptlrpc_internal.h
lustre/ptlrpc/service.c
lustre/tests/sanityn.sh
lustre/tests/test-framework.sh