Whamcloud - gitweb
LU-17855 lnet: Set peer NI down on lnet_notify
The LNet router peer health feature is intended to allow LNet routers
drop messages for peer NIs that it considers down/unreachable so that
resources can be freed to forward messages to peer NIs that are
up/reachable.
This feature was integrated with the LNet health feature under
LU-11300, and, as a result, routers only consider a peer NI
down/unreachable if two criteria are met:
1. The router hasn't received a message from the peer NI within the
LND's "peer_timeout" value (default 180 seconds).
2. The health value of the peer NI has been decremented or the cached
peer NI status is LNET_NI_STATUS_DOWN.
(1) is problematic because a lot of messages can be queued to a down
peer while we wait for the peer_timeout to expire. This can
introduce latency for messages being forwarded to other peers.
(2) is problematic because there are some cases where LNet health
will not be decremented (namely single-rail peers), and the cached
peer NI status can only be set to LNET_NI_STATUS_DOWN if the router
receives a discovery push from the peer. If the peer loses all
connectivity to the router then it is possible the router will never
consider it down.
To address the problems with (1) the requirement is dropped
completely.
To address the problems with (2), LNet routers will now decrement
health values of single-rail peers and lnet_notify() is modified to
set the peer NI status UP/DOWN according to the aliveness information
provided by the LND.
HPE-bug-id: LUS-12209
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I7823cc7ae73bcb0b6b52db8d4f84cff7b999d8c0
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55342
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>