From: Chris Horn Date: Sat, 29 Oct 2022 22:30:17 +0000 (-0600) Subject: LU-16452 kfilnd: Check replay deadline before send X-Git-Tag: 2.15.54~68 X-Git-Url: https://git.whamcloud.com/?a=commitdiff_plain;h=3049ba6ba1241770adeeeffbdfb6fef82bbf0b92;p=fs%2Flustre-release.git LU-16452 kfilnd: Check replay deadline before send The LND timeout needs to account for the total time needed for bulk operations to complete. On cassini, this can be ~120 seconds due to the CXI retry-handler timeout on both the sender and target. i.e. LND timeout is really the max round trip time, and (LND timeout)/2 is the max one-way trip time. When we replay a transaction we want to at least ensure we have enough time to deliver the message to the receiver, as this gives us a chance at still completing transactions. We should ensure that we still have (LND timeout)/2 seconds remaining before posting a new transaction. Introduce kfilnd_transaction::tn_replay_deadline, which is set to the transaction deadline minus (LND timeout)/2. Check the replay deadline in kfilnd_tn_state_idle() before attempting to post the transaction. If we've exceeded that deadline then fail the transaction with -ETIMEDOUT and set a NETWORK_TIMEOUT health status. Modify the throttle check in kfilnd_tn_state_idle() to check kfilnd_transaction::tn_replay_deadline instead of kfilnd_transaction::deadline to determine when we should timeout a transaction that is being throttled. Note, this check is switched to using ktime_before() rather than ktime_after() since the case is about checking whether we are currently before the deadline rather than after it. The current code isn't wrong. It is just grammatically awkward. HPE-bug-id: LUS-11304 Test-Parameters: trivial Signed-off-by: Chris Horn Change-Id: I1911d51cee4acea20577e3fc45c99b8948b79523 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49593 Reviewed-by: Ron Gredvig Reviewed-by: Ian Ziemba Reviewed-by: Oleg Drokin Tested-by: jenkins Tested-by: Maloo --- diff --git a/lnet/klnds/kfilnd/kfilnd.h b/lnet/klnds/kfilnd/kfilnd.h index 3173959..0939877 100644 --- a/lnet/klnds/kfilnd/kfilnd.h +++ b/lnet/klnds/kfilnd/kfilnd.h @@ -749,6 +749,8 @@ struct kfilnd_transaction { /* Transaction deadline. */ ktime_t deadline; + /* Transaction replay deadline. */ + ktime_t tn_replay_deadline; ktime_t tn_alloc_ts; ktime_t tn_state_ts; diff --git a/lnet/klnds/kfilnd/kfilnd_tn.c b/lnet/klnds/kfilnd/kfilnd_tn.c index a442b70..1056e03 100644 --- a/lnet/klnds/kfilnd/kfilnd_tn.c +++ b/lnet/klnds/kfilnd/kfilnd_tn.c @@ -612,7 +612,7 @@ static int kfilnd_tn_state_idle(struct kfilnd_transaction *tn, bool *tn_released) { struct kfilnd_msg *msg; - int rc; + int rc = 0; bool finalize = false; struct lnet_hdr hdr; struct lnet_nid srcnid; @@ -644,12 +644,13 @@ static int kfilnd_tn_state_idle(struct kfilnd_transaction *tn, */ rc = -ECANCELED; KFILND_TN_DEBUG(tn, "Cancel throttled TN"); - } else if (ktime_after(tn->deadline, ktime_get_seconds())) { - /* If transaction deadline has not been met, return - * -EAGAIN. This will cause this transaction event to be - * replayed. During this time, an async message from the - * peer should occur at which point the kfilnd version - * should be negotiated. + } else if (ktime_before(ktime_get_seconds(), + tn->tn_replay_deadline)) { + /* If the transaction replay deadline has not been met, + * then return -EAGAIN. This will cause this transaction + * event to be replayed. During this time, an async + * hello message from the peer should occur at which + * point we can resume sending new messages to this peer */ KFILND_TN_DEBUG(tn, "hello response pending"); return -EAGAIN; @@ -663,6 +664,14 @@ static int kfilnd_tn_state_idle(struct kfilnd_transaction *tn, goto out; } + if ((event == TN_EVENT_INIT_IMMEDIATE || event == TN_EVENT_INIT_BULK) && + ktime_after(ktime_get_seconds(), tn->tn_replay_deadline)) { + kfilnd_tn_status_update(tn, -ETIMEDOUT, + LNET_MSG_STATUS_NETWORK_TIMEOUT); + rc = 0; + goto out; + } + switch (event) { case TN_EVENT_INIT_IMMEDIATE: case TN_EVENT_TX_HELLO: @@ -1511,6 +1520,8 @@ struct kfilnd_transaction *kfilnd_tn_alloc_for_peer(struct kfilnd_dev *dev, tn->tn_state = TN_STATE_IDLE; tn->hstatus = LNET_MSG_STATUS_OK; tn->deadline = ktime_get_seconds() + lnet_get_lnd_timeout(); + tn->tn_replay_deadline = ktime_sub(tn->deadline, + (lnet_get_lnd_timeout() / 2)); tn->is_initiator = is_initiator; INIT_WORK(&tn->timeout_work, kfilnd_tn_timeout_work);