LU-17838 kfilnd: Prevent simultaneous hellos
There is a race condition with checking, setting and clearing the
kp_hello_pending flag that can result in multiple hello requests being
sent for the same peer. If no hello response is received after the
LND timeout then multiple threads can race with each other in
clearing the kp_hello_pending flag and posting a new hello request
message.
Thread 1: sets kp_hello_pending and posts hello request message
<No hello response received after LND timeout>
Thread 2: Clears kp_hello_pending, then sets kp_hello_sending
Thread 3: Clears kp_hello_pending, then sets kp_hello_sending
Thread 2/3: Both post hello request message
To resolve this issue we change kp_hello_pending from a simple binary
to instead track three states of a hello request: KP_HELLO_NONE,
KP_HELLO_INIT, and KP_HELLO_SENT. State is NONE when there is no
hello in the process of being sent. State is INIT when a thread is
allocating a HELLO request in preparation for sending. State is SENT
when the HELLO request is being posted. Now, when some threads detect
that we have not received hello response after LND timeout seconds
then only one of them will be able to transition to the hello state
from SENT -> NONE.
Add CFS_KFI_REPLAY_IDLE_EVENT fail_loc that can be used to delay
processing of TNs in the idle state depending on the TN event
value specified in fail_val.
HPE-bug-id: LUS-11974
Test-Parameters: trivial
Fixes:
11a32d886b ("LU-16213 kfilnd: Allow one HELLO in-flight per peer")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I4dddf57971848a80a550df7523d55ad03f4a083e
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55069
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ian Ziemba <ian.ziemba@hpe.com>
Reviewed-by: Ron Gredvig <ron.gredvig@hpe.com>
Reviewed-by: Caleb Carlson <caleb.carlson@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>