Whamcloud - gitweb
LU-14027 ldlm: Do not hang if recovery restarted during lock replay 38/40238/4
authorOleg Drokin <green@whamcloud.com>
Wed, 14 Oct 2020 03:55:02 +0000 (23:55 -0400)
committerOleg Drokin <green@whamcloud.com>
Thu, 19 Nov 2020 15:11:14 +0000 (15:11 +0000)
commit7ca495ec67f474e10352077fc40123e4818b8e69
tree490908fca4e0a17f483a6c1afbf1e7befd3e0236
parent5b74e0466edbfbbf9f336171de1adc5e583e9475
LU-14027 ldlm: Do not hang if recovery restarted during lock replay

LU-13600 introduced lock ratelimiting logic, but it did not take
into account that if there's a disconnection in the REPLAY_LOCKS
phase then yet unsent locks get stuck in the sending queue so
the replay locks thread hangs with imp_replay_inflight elevated
above zero.

The direct consequence from that is recovery state machine never
advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight
is non zero.

Adjust __ldlm_replay_locks() to check if the import state changed
before attempting to send any more requests.

Add a testcase.

Change-Id: Idbaf5461f33d1884088269d67d01071c7e1bf8a5
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Fixes: 3b613a442b ("LU-13600 ptlrpc: limit rate of lock replays")
Reviewed-on: https://review.whamcloud.com/40238
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
lustre/include/obd_support.h
lustre/ldlm/ldlm_lib.c
lustre/ldlm/ldlm_request.c
lustre/tests/replay-single.sh