Whamcloud - gitweb
LU-17838 kfilnd: Prevent simultaneous hellos 69/55069/2
authorChris Horn <chris.horn@hpe.com>
Tue, 14 Nov 2023 16:35:15 +0000 (09:35 -0700)
committerOleg Drokin <green@whamcloud.com>
Wed, 29 May 2024 04:49:24 +0000 (04:49 +0000)
commit3e826ccfce3a42fa75ed0b63518eb8c5b1f599cf
tree11f4bb1ea1a6e9abe48ea8081306b0fd5edc15cf
parentd1fd0115a4af30356b812d7cb49dec6a76c4cb72
LU-17838 kfilnd: Prevent simultaneous hellos

There is a race condition with checking, setting and clearing the
kp_hello_pending flag that can result in multiple hello requests being
sent for the same peer. If no hello response is received after the
LND timeout then multiple threads can race with each other in
clearing the kp_hello_pending flag and posting a new hello request
message.

Thread 1: sets kp_hello_pending and posts hello request message
<No hello response received after LND timeout>
Thread 2: Clears kp_hello_pending, then sets kp_hello_sending
Thread 3: Clears kp_hello_pending, then sets kp_hello_sending
Thread 2/3: Both post hello request message

To resolve this issue we change kp_hello_pending from a simple binary
to instead track three states of a hello request: KP_HELLO_NONE,
KP_HELLO_INIT, and KP_HELLO_SENT. State is NONE when there is no
hello in the process of being sent. State is INIT when a thread is
allocating a HELLO request in preparation for sending. State is SENT
when the HELLO request is being posted. Now, when some threads detect
that we have not received hello response after LND timeout seconds
then only one of them will be able to transition to the hello state
from SENT -> NONE.

Add CFS_KFI_REPLAY_IDLE_EVENT fail_loc that can be used to delay
processing of TNs in the idle state depending on the TN event
value specified in fail_val.

HPE-bug-id: LUS-11974
Test-Parameters: trivial
Fixes: 11a32d886b ("LU-16213 kfilnd: Allow one HELLO in-flight per peer")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I4dddf57971848a80a550df7523d55ad03f4a083e
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55069
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ian Ziemba <ian.ziemba@hpe.com>
Reviewed-by: Ron Gredvig <ron.gredvig@hpe.com>
Reviewed-by: Caleb Carlson <caleb.carlson@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/klnds/kfilnd/kfilnd.c
lnet/klnds/kfilnd/kfilnd.h
lnet/klnds/kfilnd/kfilnd_peer.c
lnet/klnds/kfilnd/kfilnd_tn.c