Whamcloud - gitweb
LU-17854 lnet: Router should not drop msg past deadline
It has been observed that messages can become queued in LNet on
router nodes so long that they exceed their message deadlines. These
messages will currently be dropped, even if the target peer is alive.
PtlRPC adaptive timeouts can dynamically increase to account for the
increased network latency, but if the RPCs are dropped on routers then
these operations will fail. Routers should only drop messages when
the router peer health feature determines the target is down. This
gives Lustre the best chance to complete operations during periods of
increased network latency.
A bug in sanity-lnet/do_route_del() is fixed. The lnetctl route show
output was stored in a variable named "output", but the variable
"lnetctl_text" was checked to determine if the route needed to be
deleted.
test_102() was also modified to call cleanup_router_test(). A
comment there indicated it was not needed because the routes were
already deleted, but cleanup_router_test() does more than just
delete the route entries. Namely, unloading modules on all nodes.
Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-12153
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I1e6966d4a3a2b10dd7b99620774d5c32b7eccd1f
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55131
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>