Whamcloud - gitweb
LU-16214 kfilnd: Keep stale peer entries
A peer is currently removed from the cache whenever there is a network
failure associated with the peer. This leads to situations where
incoming messages from that peer will be dropped until a handshake
can be completed.
If we instead keep these stale peer entries then we at least have a
chance of completing future transactions with the peer.
To accomplish this, we introduce states to struct kfilnd_peer.
When a kfilnd_peer is newly allocated it is assigned a state of
KP_STATE_NEW. kfilnd_peer_is_new_peer() is modified to check for this
state rather than check if kp_version is set.
When a handshake is completed the peer is assigned a state of
KP_STATE_UPTODATE.
When a peer that is up-to-date experiences a failed network operation
then it is assigned a state of KP_STATE_STALE. kfilnd_peer_stale() is
introduced to set this state. Existing callers of kfilnd_peer_down()
are converted to call kfilnd_peer_stale(). kfilnd_peer_down() is
renamed to kfilnd_peer_del().
We will initiate a handshake to any peer that is in either
KP_STATE_NEW or KP_STATE_STALE. kfilnd_peer_needs_hello() is
modified accordingly.
struct kfilnd_peer::kp_last_alive is checked by kfilnd_peer_stale().
If we haven't heard from a stale peer within five LND timeout periods,
then that peer is deleted.
An additional kfilnd_peer_alive() call is added to
kfilnd_tn_state_idle() for the TN_EVENT_RX_HELLO case, so that
peer aliveness is updated when we receive a hello request or response.
HPE-bug-id: LUS-11125
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Icfb722e58fa334d983df02742dc456a55ac2abc3
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48785
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ian Ziemba <ian.ziemba@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Ron Gredvig <ron.gredvig@hpe.com>