Whamcloud - gitweb
LU-7434 ptlrpc: lost bulk leads to a hang 53/19953/4
authorVitaly Fertman <vitaly.fertman@seagate.com>
Tue, 1 Mar 2016 23:46:31 +0000 (02:46 +0300)
committerOleg Drokin <oleg.drokin@intel.com>
Thu, 16 Jun 2016 22:16:12 +0000 (22:16 +0000)
commitac5044566b97c7f6881bed817c2ed9752a0c6d63
treedf16c3bbac40fb3025e5fbe01e21f4ac7342c268
parent5042d4c8ecad287276e04c52c6a1fee9c9b597a9
LU-7434 ptlrpc: lost bulk leads to a hang

This is a combination of two commits. The original commit from:
http://review.whamcloud.com/17221
and a fix to that commit from:
http://review.whamcloud.com/19758

Description of the original patch:
The reverse order of request_out_callback() and reply_in_callback()
puts the RPC into UNREGISTERING state, which is waiting for RPC &
bulk md unlink, whereas only RPC md unlink has been called so far.
If bulk is lost, even expired_set does not check for UNREGISTERING
state.

The same for write if server returns an error.

This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.

The fix to the original commit was originally pushed against LU-8062.
That fix is described thusly:
LU-8062 test: fix fail_val in recovery-small/115b

The fail_loc OBD_FAIL_OST_ENOSPC is caught in 2 places:
1. ofd_statfs() - where it is checked against the fail_val
and if fail_val matches the OST ID it is caught here
itself and hence test fails.
2. tgt_brw_write() - the actual place where the fail_loc should
be caught.

The patch makes the fail_loc to be caught at the appropriate place
by setting the fail_val to $OSTCOUNT. So even if fail_loc is set
in the statfs part it is not caught there and therefore caught in
tgt_brw_write().

Please note the author of the original patch was Vitaly Fertman
<vitaly.fertman@seagate.com>, and the author of the test fix was
Bhagyesh Dudhediya <bhagyesh.dudhediya@seagate.com>

Test-Parameters: testlist=recovery-small,recovery-small,recovery-small,recovery-small,recovery-small,recovery-small
Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
Signed-off-by: Bhagyesh Dudhediya <bhagyesh.dudhediya@seagate.com>
Signed-off-by: Chris Horn <hornc@cray.com>
Seagate-bug-id:  MRP-2953, MRP-3206, MRP-3150
Reviewed-by: Andriy Skulysh <andriy.skulysh@seagate.com>
Reviewed-by: Alexey Leonidovich Lyashkov <alexey.lyashkov@seagate.com>
Tested-by: Elena V. Gryaznova <elena.gryaznova@seagate.com>
Change-Id: I17319d40881c41f247c102aafc3a1b0db82d0b4a
Reviewed-on: http://review.whamcloud.com/19953
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Ann Koehler <amk@cray.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
lustre/include/lustre_net.h
lustre/include/obd_support.h
lustre/ptlrpc/client.c
lustre/ptlrpc/import.c
lustre/ptlrpc/niobuf.c
lustre/target/tgt_handler.c
lustre/tests/conf-sanity.sh
lustre/tests/recovery-small.sh
lustre/tests/test-framework.sh