Whamcloud - gitweb
LU-14027 ldlm: Do not hang if recovery restarted during lock replay 24/41224/2
authorOleg Drokin <green@whamcloud.com>
Wed, 14 Oct 2020 03:55:02 +0000 (23:55 -0400)
committerOleg Drokin <green@whamcloud.com>
Thu, 4 Mar 2021 08:36:43 +0000 (08:36 +0000)
commit5fa7c8f24e71187a0c3ac70a04a8b566de5a76f3
tree935bda2ad4609ace7d4c42d5bffacb1160ed0496
parent2bcc166b0a660afab62d96ede496f42c31ada94b
LU-14027 ldlm: Do not hang if recovery restarted during lock replay

LU-13600 introduced lock ratelimiting logic, but it did not take
into account that if there's a disconnection in the REPLAY_LOCKS
phase then yet unsent locks get stuck in the sending queue so
the replay locks thread hangs with imp_replay_inflight elevated
above zero.

The direct consequence from that is recovery state machine never
advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight
is non zero.

Adjust __ldlm_replay_locks() to check if the import state changed
before attempting to send any more requests.

Add a testcase.

Lustre-change: https://review.whamcloud.com/40238
Lustre-commit: 7ca495ec67f474e10352077fc40123e4818b8e69

Change-Id: Idbaf5461f33d1884088269d67d01071c7e1bf8a5
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Fixes: 3b613a442b ("LU-13600 ptlrpc: limit rate of lock replays")
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Fixes: 6b6d9c0911 ("LU-13600 ptlrpc: limit rate of lock replays")
Reviewed-on: https://review.whamcloud.com/41224
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
lustre/include/obd_support.h
lustre/ldlm/ldlm_lib.c
lustre/ldlm/ldlm_request.c
lustre/tests/replay-single.sh