From: Chris Horn Date: Mon, 25 Sep 2023 19:03:20 +0000 (-0500) Subject: LU-17103 lnet: Avoid deadlock when updating ping buffer X-Git-Tag: 2.15.59~55 X-Git-Url: https://git.whamcloud.com/?a=commitdiff_plain;h=3ca6ba39a21cfebc81bbe7f889c486bb82bb563a;p=fs%2Flustre-release.git LU-17103 lnet: Avoid deadlock when updating ping buffer lnet_peer_send_push() adds a reference to the the_lnet.ln_ping_target lnet_ping_buffer. This reference is dropped by lnet_discovery_event_handler. When the LNet configuration is modified the ln_api_mutex is held and lnet_ping_target_update() is called to update the ln_ping_target to reflect the new configuration. While holding the ln_api_mutex, lnet_ping_target_update() will wait until all refs on the old ping buffer are dropped. This can result in a deadlock if the ln_api_mutex is required to complete the push. Here is one scenario where this can happen: 1. PUSH is sent by discovery thread 2. LNet configuration is modified. lnetctl process is holding ln_api_mutex and waiting in lnet_ping_target_update() 3. Local NI goes into recovery 4. Monitor thread wakes and attempts to send ping to local NI. If this is the first ping sent to this NI then monitor thread needs ln_api_mutex to create peer NI object for local NI. (LNetGet-> lnet_send-> lnet_select_pathway-> lnet_peerni_by_nid_locked-> mutex_lock(&the_lnet.ln_api_mutex)) 5. PUSH (1) fails with local timeout. It is placed on monitor thread resend queue. 6. monitor thread cannot process resend queue until it acquires ln_api_mutex. ln_api_mutex cannot be acquired until monitor thread processes resend queue. Deadlock. Fix is to drop ln_api_mutex before calling lnet_ping_md_unlink() in lnet_ping_target_update(). This should ensure that updates to the ping target are still synchronized via ln_api_mutex as intended, but we're able to clear refs on the old ping buffer as needed. Test-Parameters: trivial Test-Parameters: testlist=sanity-lnet env=ONLY=207,ONLY_REPEAT=50 Signed-off-by: Chris Horn Change-Id: I20cda185a865192f1ad162eaef1b8b4e5d751b2c Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52479 Tested-by: jenkins Tested-by: Maloo Reviewed-by: James Simmons Reviewed-by: Oleg Drokin Reviewed-by: Frank Sehr Reviewed-by: Serguei Smirnov --- diff --git a/lnet/lnet/api-ni.c b/lnet/lnet/api-ni.c index 267f4cc..c36f004 100644 --- a/lnet/lnet/api-ni.c +++ b/lnet/lnet/api-ni.c @@ -2102,6 +2102,7 @@ lnet_ping_target_install_locked(struct lnet_ping_buffer *pbuf) static void lnet_ping_target_update(struct lnet_ping_buffer *pbuf, struct lnet_handle_md ping_mdh) +__must_hold(&the_lnet.ln_api_mutex) { struct lnet_ping_buffer *old_pbuf = NULL; struct lnet_handle_md old_ping_md; @@ -2130,8 +2131,15 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf, lnet_net_unlock(LNET_LOCK_EX); if (old_pbuf) { - /* unlink and free the old ping info */ + /* unlink and free the old ping info. + * There may be outstanding traffic on this MD, and + * ln_api_mutex may be required to finalize that + * traffic. Release ln_api_mutex while we wait for + * refs on this ping buffer to drop + */ + mutex_unlock(&the_lnet.ln_api_mutex); lnet_ping_md_unlink(old_pbuf, &old_ping_md); + mutex_lock(&the_lnet.ln_api_mutex); lnet_ping_buffer_decref(old_pbuf); }