Whamcloud - gitweb
LU-17480 o2iblnd: add a timeout for rdma_connect 86/53986/7
authorEtienne AUJAMES <etienne.aujames@cea.fr>
Mon, 5 Feb 2024 14:12:20 +0000 (15:12 +0100)
committerOleg Drokin <green@whamcloud.com>
Tue, 25 Jun 2024 03:26:05 +0000 (03:26 +0000)
commit0b8c18d8c86357c557e959779e219ca7fd24d5d8
treebfa516084d7a0f5a77b97d14f8ad258db7df3c39
parent1fa633c2031c3e34072d5486aeedfbf222091ac7
LU-17480 o2iblnd: add a timeout for rdma_connect

For a RoCE network, if a RDMA connection request is sent to an
unreachable node, the CM can take >4min to return
CM_EVENT_UNREACHABLE.
This hangs lustre_rmmod if a Lustre router is down.

This patch track connection requests and apply a timeout of
lnd_timeout/4 (with a minimum of 5s) to destroy the hanging
connection.

Also, the patch decrease the timeout for
rdma_resolve_addr()/rdma_resolve_route() to 5s (like most of
the upstream drivers: sunrpc, smb).

The default timeouts should be:

lnd_timeout = (transaction_timeout - 1) / (retry_count + 1)
lnd_timeout = (150 - 1) / 3 = 49s
lnd_connreq_timeout = max(5, lnd_timeout / 4) = 12s

Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Change-Id: I09e40ffaa75424c4acca1d0cf986e1ff9c6dc96b
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53986
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/klnds/o2iblnd/o2iblnd.c
lnet/klnds/o2iblnd/o2iblnd.h
lnet/klnds/o2iblnd/o2iblnd_cb.c