Whamcloud - gitweb
LU-17505 socklnd: return NETWORK_TIMEOUT to LNet on ETIMEOUT 30/53930/2
authorSerguei Smirnov <ssmirnov@whamcloud.com>
Mon, 5 Feb 2024 23:27:15 +0000 (15:27 -0800)
committerOleg Drokin <green@whamcloud.com>
Fri, 23 Feb 2024 07:16:04 +0000 (07:16 +0000)
Returning LNET_MSG_STATUS_LOCAL_TIMEOUT to LNet on ETIMEDOUT
causes LNet to only decrement the local NI health score,
while the issue may actually be with the remote NI.

Changing this to return LNET_MSG_STATUS_NETWORK_TIMEOUT
causes LNet to decrement both local NI and peer NI health.
If local NI is ok, it will recover its health score quickly,
but the affected peer NI health is lowered until peer NI is recovered.
This helps LNet select healthy NIs of the same peer in the meantime.

Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I916772477d1fd63571447262880a33830746f002
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53930
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/klnds/socklnd/socklnd_cb.c

index 51e1dba..30e4771 100644 (file)
@@ -437,7 +437,7 @@ ksocknal_txlist_done(struct lnet_ni *ni, struct list_head *txlist, int error)
                if (tx->tx_hstatus == LNET_MSG_STATUS_OK) {
                        if (error == -ETIMEDOUT)
                                tx->tx_hstatus =
-                                 LNET_MSG_STATUS_LOCAL_TIMEOUT;
+                                       LNET_MSG_STATUS_NETWORK_TIMEOUT;
                        else if (error == -ENETDOWN ||
                                 error == -EHOSTUNREACH ||
                                 error == -ENETUNREACH ||
@@ -2418,7 +2418,7 @@ ksocknal_find_timed_out_conn(struct ksock_peer_ni *peer_ni)
                        list_for_each_entry(tx, &conn->ksnc_tx_queue,
                                            tx_list)
                                tx->tx_hstatus =
-                                       LNET_MSG_STATUS_LOCAL_TIMEOUT;
+                                       LNET_MSG_STATUS_NETWORK_TIMEOUT;
                        CNETERR("Timeout sending data to %s (%pIScp) the network or that node may be down.\n",
                                libcfs_idstr(&peer_ni->ksnp_id),
                                &conn->ksnc_peeraddr);
@@ -2445,7 +2445,7 @@ ksocknal_flush_stale_txs(struct ksock_peer_ni *peer_ni)
                if (ktime_get_seconds() < tx->tx_deadline)
                        break;
 
-               tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
+               tx->tx_hstatus = LNET_MSG_STATUS_NETWORK_TIMEOUT;
 
                list_move_tail(&tx->tx_list, &stale_txs);
        }