Whamcloud - gitweb
LU-10931 lnet: handle unlink before send completes 98/45898/2
authorAmir Shehata <ashehata@whamcloud.com>
Mon, 8 Jul 2019 19:33:31 +0000 (12:33 -0700)
committerOleg Drokin <green@whamcloud.com>
Sun, 30 Jan 2022 03:42:09 +0000 (03:42 +0000)
If LNetMDUnlink() is called on an md with md->md_refcount > 0 then
the eq callback isn't called.
There is a scenario where the response times out before the send
completes. So we have a refcount on the MD. The Unlink callback gets
dropped on the floor. Send completes, but because we've already timed
out, the REPLY for the GET is dropped. Now we're left with a peer
that is in the following state:
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERING
LNET_PEER_PING_SENT
But no more events are coming to it, and the discovery never
completes.

This scenario can get RPCs stuck as well if the response times out
before the send completes.

The solution is to set the event status to -ETIMEDOUT to inform
the send event handler that it should not expect a reply.

Lustre-commit: d8fc5c23fe541e0ff6ce5bec6302957714c3f69f
Lustre-change: https://review.whamcloud.com/35444

Test-Parameters: trivial
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Change-Id: Ica0e1a823d0d1200bb8cc42a6e058785da1d4fa4
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/45898
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/lnet/lib-msg.c

index 959c370..9e52200 100644 (file)
@@ -864,7 +864,12 @@ lnet_msg_detach_md(struct lnet_msg *msg, int cpt, int status)
 
        unlink = lnet_md_unlinkable(md);
        if (md->md_eq != NULL) {
-               msg->msg_ev.status   = status;
+               if ((md->md_flags & LNET_MD_FLAG_ABORTED) && !status) {
+                       msg->msg_ev.status   = -ETIMEDOUT;
+                       CDEBUG(D_NET, "md 0x%p already unlinked\n", md);
+               } else {
+                       msg->msg_ev.status   = status;
+               }
                msg->msg_ev.unlinked = unlink;
                lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);
        }