Whamcloud - gitweb
LU-16149 lnet: Discovery queue and deletion race
lnet_peer_deletion() can race with another thread calling
lnet_peer_queue_for_discovery.
Discovery thread:
- Calls lnet_peer_deletion():
- LNET_PEER_DISCOVERING bit is cleared from lnet_peer::lp_state
- releases lnet_peer::lp_lock
Another thread:
- Acquires lnet_net_lock/EX
- Calls lnet_peer_queue_for_discovery()
- Takes lnet_peer::lp_lock
- Sets LNET_PEER_DISCOVERING bit
- Releases lnet_peer::lp_lock
- Sees lnet_peer::lp_dc_list is not empty, so it does not add peer
to dc request queue
- lnet_peer_queue_for_discovery() returns, lnet_net_lock/EX releases
Discovery thread:
- Acquires lnet_net_lock/EX
- Deletes peer from ln_dc_working list
- performs the peer deletion
At this point, the peer is not on any discovery list, and it has
LNET_PEER_DISCOVERING bit set. This peer is now stranded, and any
messages on the peer's lnet_peer::lp_dc_pendq are likewise stranded.
To solve this, we modify lnet_peer_deletion() so that it waits to
clear the LNET_PEER_DISCOVERING bit until it has completed deleting
the peer and re-acquired the lnet_peer::lp_lock. This ensures we
cannot race with any other thread that may add the
LNET_PEER_DISCOVERING bit back to the peer. We also avoid deleting
the peer from the ln_dc_working list in lnet_peer_deletion(). This is
already done by lnet_peer_discovery_complete().
There is another window where the LNET_PEER_DISCOVERING bit can be
added when the discovery thread drops the lp_lock just before
acquiring the net_lock/EX and calling lnet_peer_discovery_complete().
Have lnet_peer_discovery_complete() clear LNET_PEER_DISCOVERING to
deal with this (it already does this for the case where discovery hit
an error). Also move the deletion of lp_dc_list to after we clear the
DISCOVERING bit. This is to mirror the behavior of
lnet_peer_queue_for_discovery() which sets the DISCOVERING bit and
then manipulates the lp_dc_list.
Also tweak the logic in lnet_peer_deletion() to call
lnet_peer_del_locked() in order to avoid extra calls to
lnet_net_lock()/lnet_net_unlock().
HPE-bug-id: LUS-11237
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ifcfef1d49f216af4ddfcdaf928024e8ee3952555
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48532
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>