Whamcloud - gitweb
LU-13362 lnet: Disc reply race with finalize and routed recv 37/37937/2
authorChris Horn <hornc@cray.com>
Fri, 13 Mar 2020 19:34:23 +0000 (14:34 -0500)
committerOleg Drokin <green@whamcloud.com>
Tue, 24 Mar 2020 05:16:25 +0000 (05:16 +0000)
commitc700b4a410ab5542391d006ce541023ecf9b7a5d
tree6750f49abab98458175f275c43cf090b2b6d4cee
parent6742ac7c8ad50cb18d371b473869f2bd26a6d79a
LU-13362 lnet: Disc reply race with finalize and routed recv

A race exists between a thread handling a discovery reply, and
another thread in the lnet_finalize() call path, or in any of
the code paths that result in lnet_post_routed_recv_locked().

The discovery reply handler takes the lp_lock, and while holding
that lock, tries to acquire the lpni_lock for each lpni associated
with the lnet_peer object.

In lnet_return_rx_credits_locked() (lnet_finalize() code path) and
lnet_post_routed_recv_locked() (called via a couple different code
paths) the lpni_lock is taken, and then the lp_lock is taken for the
associated lnet_peer object.

Thread A: spin_lock(lp_lock)
Thread B: spin_lock(lpni_lock)
Thread B: spin_lock(lp_lock)
Thread A: spin_lock(lpni_lock)

This results in deadlock. The lp_lock and lpni_lock do not need to be
held at the same time in lnet_return_rx_credits_locked() nor in
lnet_post_routed_recv_locked().

Cray-bug-id: LUS-8607
Signed-off-by: Chris Horn <hornc@cray.com>
Change-Id: Ie4e9a172b4705d9f5723a6da1ff251b380ad47ac
Reviewed-on: https://review.whamcloud.com/37937
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/lnet/lib-move.c