Whamcloud - gitweb
LU-14518 ptlrpc: avoid server STONITH for slow request
If a service is temporarily overloaded, request processing times
may exceed at_max for a short time. We don't want to increase
at_max excessively, since that slows down client RPC resend and
recovery, but we also want to avoid server STONITH because at_max
is used directly by ptlrpc_svcpt_health_check() to determine if
the service reports "NOT HEALTHY" and forces HA takeover.
Slow request processing is not as serious as, say, an LBUG, so
allow a configurable parameter at_unhealthy_factor to allow
requests to exceed at_max before a service is considered unhealthy.
This defaults to 3x at_max, and 0 disables service health checks.
Also, importantly, it shouldn't be considered an error if the
RPC *requests* are waiting a long time (that can happen if the
server is overloaded, or NRS is delaying some RPCs), but only
if the *service* is unable to process *any* requests in a long
time. Otherwise, it will only print a warning of delayed RPCs.
Add sanityn.sh test_200 to exercise related health check test
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ifaf4454efacf5f5ec8fc24f75a49e17e5a3ebbe5
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53225
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
12 files changed: