Whamcloud - gitweb
LU-12344 lnet: handle remote health error 30/36030/6
authorAmir Shehata <ashehata@whamcloud.com>
Mon, 27 May 2019 17:43:10 +0000 (10:43 -0700)
committerOleg Drokin <green@whamcloud.com>
Tue, 8 Oct 2019 13:25:29 +0000 (13:25 +0000)
When a peer is dead set the health status to REMOTE_DROPPED
in order to handle health properly for the peer.
When dropping a routed message set REMOTE_ERROR. Routed messages
are dropped when the routing feature is turned off which could
be considered a configuration error if it happens in the middle
of traffic. Therefore, it's better to flag this issue at this
point without resending the message.

Lustre-change: https://review.whamcloud.com/34967
Lustre-commit: b45e3d96fc4d82ebf5b1bb3ef0b5a59e8ff86e75

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Change-Id: I131263215a68fc8607582643a47007ce4d04abbc
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36030
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/lnet/lib-move.c

index 27c91d7..8d73fa7 100644 (file)
@@ -959,7 +959,7 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
 
                CNETERR("Dropping message for %s: peer not alive\n",
                        libcfs_id2str(msg->msg_target));
-               msg->msg_health_status = LNET_MSG_STATUS_LOCAL_DROPPED;
+               msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;
                if (do_send)
                        lnet_finalize(msg, -EHOSTUNREACH);
 
@@ -976,6 +976,8 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
                        libcfs_id2str(msg->msg_target));
                if (do_send) {
                        msg->msg_no_resend = true;
+                       CDEBUG(D_NET, "msg %p to %s canceled and will not be resent\n",
+                              msg, libcfs_id2str(msg->msg_target));
                        lnet_finalize(msg, -ECANCELED);
                }
 
@@ -1254,6 +1256,7 @@ lnet_drop_routed_msgs_locked(struct list_head *list, int cpt)
                             0, 0, 0, msg->msg_hdr.payload_length);
                list_del_init(&msg->msg_list);
                msg->msg_no_resend = true;
+               msg->msg_health_status = LNET_MSG_STATUS_REMOTE_ERROR;
                lnet_finalize(msg, -ECANCELED);
        }