Whamcloud - gitweb
LU-13802 ptlrpc: correctly remove inflight request 99/54099/8
authorPatrick Farrell <paf0187@gmail.com>
Wed, 13 Mar 2024 14:46:12 +0000 (10:46 -0400)
committerOleg Drokin <green@whamcloud.com>
Tue, 23 Apr 2024 19:44:57 +0000 (19:44 +0000)
When removing a request from the active set on error, we
must also remove it from "inflight" or we will not reduce
inflight as needed and hang on cleanup.

This bug has been latent for some time, but running sanity
414 with hybrid IO tends to trigger it.

Signed-off-by: Patrick Farrell <patrick.farrell@oracle.com>
Change-Id: Ib73980724f6e2f5a74400a39840df2e8835a6e23
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/54099
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
lustre/ptlrpc/client.c
lustre/tests/sanity.sh

index fcfd417..a3c593e 100644 (file)
@@ -2127,8 +2127,11 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
                                rc = ptl_send_rpc(req, 0);
                                if (rc == -ENOMEM) {
                                        spin_lock(&imp->imp_lock);
-                                       if (!list_empty(&req->rq_list))
+                                       if (!list_empty(&req->rq_list)) {
                                                list_del_init(&req->rq_list);
+                                               if (atomic_dec_and_test(&imp->imp_inflight))
+                                                       wake_up(&imp->imp_recovery_waitq);
+                                       }
                                        spin_unlock(&imp->imp_lock);
                                        ptlrpc_rqphase_move(req, RQ_PHASE_NEW);
                                        continue;
index ae3b3c4..e5a43f5 100755 (executable)
@@ -29609,6 +29609,11 @@ test_414() {
        $LCTL set_param fail_loc=0x80000521
        dd if=/dev/zero of=$DIR/$tfile bs=2M count=1 oflag=sync
        rm -f $DIR/$tfile
+       # This error path has sometimes left inflight requests dangling, so
+       # test for this by remounting the client (umount will hang if there's
+       # a dangling request)
+       umount_client $MOUNT
+       mount_client $MOUNT
 }
 run_test 414 "simulate ENOMEM in ptlrpc_register_bulk()"