From: Chris Horn Date: Fri, 22 Nov 2019 20:19:03 +0000 (-0600) Subject: LU-13001 lnet: Wait for single discovery attempt of routers X-Git-Tag: 2.13.51~104 X-Git-Url: https://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commitdiff_plain;h=d45a032d9a5c6929f62e00e75d8fb0103cc0fbb4 LU-13001 lnet: Wait for single discovery attempt of routers Historically, check_routers_before_use would cause LNet initialization to pause until all routers had been ping'd once. This behavior was changed in commit fe17e9b8370affe063769b880f02b9190584baaa from LU-11298. Now, LNet will wait indefinitely until discovery completes on all routers. This is problematic, because if even one router is down then LNet will stall forever. Introduce a new lnet_peer state to indicate whether a router has been discovered (either successfully or not) to restore the historic behavior. Fixes fe17e9b8370a ("LU-11298 lnet: use peer for gateway") Test-Parameters: trivial Cray-bug-id: LUS-8184 Signed-off-by: Chris Horn Change-Id: Ia064ffeb3e918cdb8d5a6150f443c48aa14e7a7c Reviewed-on: https://review.whamcloud.com/36820 Tested-by: jenkins Reviewed-by: Amir Shehata Tested-by: Maloo Reviewed-by: Neil Brown Reviewed-by: Oleg Drokin --- diff --git a/lnet/include/lnet/lib-types.h b/lnet/include/lnet/lib-types.h index cc02d45..5590503 100644 --- a/lnet/include/lnet/lib-types.h +++ b/lnet/include/lnet/lib-types.h @@ -747,6 +747,8 @@ struct lnet_peer { /* gw undergoing alive discovery */ #define LNET_PEER_RTR_DISCOVERY (1 << 16) +/* gw has undergone discovery (does not indicate success or failure) */ +#define LNET_PEER_RTR_DISCOVERED (1 << 17) struct lnet_peer_net { /* chain on lp_peer_nets */ diff --git a/lnet/lnet/router.c b/lnet/lnet/router.c index c7368ca..255a30e 100644 --- a/lnet/lnet/router.c +++ b/lnet/lnet/router.c @@ -427,6 +427,7 @@ lnet_router_discovery_complete(struct lnet_peer *lp) spin_lock(&lp->lp_lock); lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY; + lp->lp_state |= LNET_PEER_RTR_DISCOVERED; spin_unlock(&lp->lp_lock); /* @@ -924,7 +925,7 @@ lnet_wait_known_routerstate(void) spin_lock(&rtr->lp_lock); - if ((rtr->lp_state & LNET_PEER_DISCOVERED) == 0) { + if ((rtr->lp_state & LNET_PEER_RTR_DISCOVERED) == 0) { all_known = 0; spin_unlock(&rtr->lp_lock); break;