Whamcloud - gitweb
LU-13001 lnet: Wait for single discovery attempt of routers 20/36820/2
authorChris Horn <hornc@cray.com>
Fri, 22 Nov 2019 20:19:03 +0000 (14:19 -0600)
committerOleg Drokin <green@whamcloud.com>
Sat, 14 Dec 2019 05:58:04 +0000 (05:58 +0000)
Historically, check_routers_before_use would cause LNet
initialization to pause until all routers had been ping'd once.

This behavior was changed in commit
fe17e9b8370affe063769b880f02b9190584baaa from LU-11298. Now, LNet
will wait indefinitely until discovery completes on all routers.
This is problematic, because if even one router is down then LNet
will stall forever.

Introduce a new lnet_peer state to indicate whether a router has
been discovered (either successfully or not) to restore the historic
behavior.

Fixes fe17e9b8370a ("LU-11298 lnet: use peer for gateway")

Test-Parameters: trivial
Cray-bug-id: LUS-8184
Signed-off-by: Chris Horn <hornc@cray.com>
Change-Id: Ia064ffeb3e918cdb8d5a6150f443c48aa14e7a7c
Reviewed-on: https://review.whamcloud.com/36820
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/include/lnet/lib-types.h
lnet/lnet/router.c

index cc02d45..5590503 100644 (file)
@@ -747,6 +747,8 @@ struct lnet_peer {
 
 /* gw undergoing alive discovery */
 #define LNET_PEER_RTR_DISCOVERY (1 << 16)
+/* gw has undergone discovery (does not indicate success or failure) */
+#define LNET_PEER_RTR_DISCOVERED (1 << 17)
 
 struct lnet_peer_net {
        /* chain on lp_peer_nets */
index c7368ca..255a30e 100644 (file)
@@ -427,6 +427,7 @@ lnet_router_discovery_complete(struct lnet_peer *lp)
 
        spin_lock(&lp->lp_lock);
        lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY;
+       lp->lp_state |= LNET_PEER_RTR_DISCOVERED;
        spin_unlock(&lp->lp_lock);
 
        /*
@@ -924,7 +925,7 @@ lnet_wait_known_routerstate(void)
 
                        spin_lock(&rtr->lp_lock);
 
-                       if ((rtr->lp_state & LNET_PEER_DISCOVERED) == 0) {
+                       if ((rtr->lp_state & LNET_PEER_RTR_DISCOVERED) == 0) {
                                all_known = 0;
                                spin_unlock(&rtr->lp_lock);
                                break;