Whamcloud - gitweb
LU-13892 lnet: lock-up during router check 72/40172/2
authorSerguei Smirnov <ssmirnov@whamcloud.com>
Wed, 7 Oct 2020 22:51:06 +0000 (18:51 -0400)
committerOleg Drokin <green@whamcloud.com>
Thu, 22 Oct 2020 06:19:05 +0000 (06:19 +0000)
This is a fix for the issue with LNet lock-up while waiting
for routers to become active with check_routers_before_use
option. Release ln_api_mutex while waiting to allow
incoming connections to be handled.

Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I63b1d1ce5ee2b27a3bd2cea78713fc6fc7502cf7
Reviewed-on: https://review.whamcloud.com/40172
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/lnet/router.c

index 806cf8c..e2966cf 100644 (file)
@@ -842,6 +842,7 @@ lnet_wait_known_routerstate(void)
 
        LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING);
 
+       /* the_lnet.ln_api_mutex must be locked */
        for (;;) {
                int cpt = lnet_net_lock_current();
 
@@ -865,8 +866,10 @@ lnet_wait_known_routerstate(void)
                if (all_known)
                        return;
 
+               mutex_unlock(&the_lnet.ln_api_mutex);
                set_current_state(TASK_UNINTERRUPTIBLE);
                schedule_timeout(cfs_time_seconds(1));
+               mutex_lock(&the_lnet.ln_api_mutex);
        }
 }