git://git.whamcloud.com - fs/lustre-release.git/commit

LU-16214 kfilnd: Keep stale peer entries

A peer is currently removed from the cache whenever there is a network
failure associated with the peer. This leads to situations where
incoming messages from that peer will be dropped until a handshake
can be completed.

If we instead keep these stale peer entries then we at least have a
chance of completing future transactions with the peer.

To accomplish this, we introduce states to struct kfilnd_peer.

When a kfilnd_peer is newly allocated it is assigned a state of
KP_STATE_NEW. kfilnd_peer_is_new_peer() is modified to check for this
state rather than check if kp_version is set.

When a handshake is completed the peer is assigned a state of
KP_STATE_UPTODATE.

When a peer that is up-to-date experiences a failed network operation
then it is assigned a state of KP_STATE_STALE. kfilnd_peer_stale() is
introduced to set this state. Existing callers of kfilnd_peer_down()
are converted to call kfilnd_peer_stale(). kfilnd_peer_down() is
renamed to kfilnd_peer_del().

We will initiate a handshake to any peer that is in either
KP_STATE_NEW or KP_STATE_STALE. kfilnd_peer_needs_hello() is
modified accordingly.

struct kfilnd_peer::kp_last_alive is checked by kfilnd_peer_stale().
If we haven't heard from a stale peer within five LND timeout periods,
then that peer is deleted.

An additional kfilnd_peer_alive() call is added to
kfilnd_tn_state_idle() for the TN_EVENT_RX_HELLO case, so that
peer aliveness is updated when we receive a hello request or response.

HPE-bug-id: LUS-11125
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Icfb722e58fa334d983df02742dc456a55ac2abc3
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48785
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ian Ziemba <ian.ziemba@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Ron Gredvig <ron.gredvig@hpe.com>

author	Chris Horn <chris.horn@hpe.com>
	Fri, 19 Aug 2022 20:27:26 +0000 (14:27 -0600)
committer	Oleg Drokin <green@whamcloud.com>
	Thu, 19 Jan 2023 15:30:35 +0000 (15:30 +0000)
commit	c1f7eaa24f14aa567b80d99676c765db2b331d40
tree	83f29af66d684f1af0eced70a1d7a2e0d3c64979	tree \| snapshot
parent	08bbe9e562c403f247a74e99101d238398df6351	commit \| diff

lnet/klnds/kfilnd/kfilnd.c		diff \| blob \| history
lnet/klnds/kfilnd/kfilnd.h		diff \| blob \| history
lnet/klnds/kfilnd/kfilnd_peer.c		diff \| blob \| history
lnet/klnds/kfilnd/kfilnd_peer.h		diff \| blob \| history
lnet/klnds/kfilnd/kfilnd_tn.c		diff \| blob \| history