Whamcloud - gitweb
LU-17480 o2iblnd: add a timeout for rdma_connect
For a RoCE network, if a RDMA connection request is sent to an
unreachable node, the CM can take >4min to return
CM_EVENT_UNREACHABLE.
This hangs lustre_rmmod if a Lustre router is down.
This patch track connection requests and apply a timeout of
lnd_timeout/4 (with a minimum of 5s) to destroy the hanging
connection.
Also, the patch decrease the timeout for
rdma_resolve_addr()/rdma_resolve_route() to 5s (like most of
the upstream drivers: sunrpc, smb).
The default timeouts should be:
lnd_timeout = (transaction_timeout - 1) / (retry_count + 1)
lnd_timeout = (150 - 1) / 3 = 49s
lnd_connreq_timeout = max(5, lnd_timeout / 4) = 12s
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Change-Id: I09e40ffaa75424c4acca1d0cf986e1ff9c6dc96b
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53986
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>