From: Vitaly Fertman Date: Tue, 13 Jul 2021 16:07:14 +0000 (+0300) Subject: LU-14847 ptlrpc: two replay lock threads X-Git-Tag: 2.14.55~74 X-Git-Url: https://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commitdiff_plain;h=refs%2Fchanges%2F94%2F44294%2F5 LU-14847 ptlrpc: two replay lock threads conflict to each other what leads to: ASSERTION( atomic_read(&imp->imp_replay_inflight) == 1 ) replay_lock_interpret() does ptlrpc_connect_import() on error, and one thread will appear starting with connect reply interpret. replay_lock_interpret() also wakes up ldlm_lock_replay_thread() which does ptlrpc_import_recovery_state_machine(). It may happen that both threads will get to ldlm_replay_locks() on the next round at the same time, both increment imp_replay_inflight and the second one will assert. The problem appeared in LU-13600 which added ldlm_lock_replay_thread() with the ptlrpc_import_recovery_state_machine() call. HPE-bug-id: LUS-10147 Fixes: 3b613a442b ("LU-13600 ptlrpc: limit rate of lock replays") Signed-off-by: Vitaly Fertman Change-Id: Ia9aafb631e3ba5f850504cc58b4826acec2813bd Reviewed-by: Andriy Skulysh Reviewed-by: Alexander Zarochentsev Reviewed-on: https://es-gerrit.dev.cray.com/158931 Tested-by: Jenkins Build User Reviewed-on: https://review.whamcloud.com/44294 Reviewed-by: Andreas Dilger Reviewed-by: Mike Pershin Tested-by: jenkins Tested-by: Maloo Reviewed-by: Oleg Drokin --- diff --git a/lustre/ldlm/ldlm_request.c b/lustre/ldlm/ldlm_request.c index 961b2f2..f3c75c7 100644 --- a/lustre/ldlm/ldlm_request.c +++ b/lustre/ldlm/ldlm_request.c @@ -2566,7 +2566,8 @@ int __ldlm_replay_locks(struct obd_import *imp, bool rate_limit) ENTRY; - LASSERT(atomic_read(&imp->imp_replay_inflight) == 1); + while (atomic_read(&imp->imp_replay_inflight) != 1) + cond_resched(); /* don't replay locks if import failed recovery */ if (imp->imp_vbr_failed) @@ -2621,9 +2622,12 @@ int ldlm_replay_locks(struct obd_import *imp) struct task_struct *task; int rc = 0; - class_import_get(imp); /* ensure this doesn't fall to 0 before all have been queued */ - atomic_inc(&imp->imp_replay_inflight); + if (atomic_inc_return(&imp->imp_replay_inflight) > 1) { + atomic_dec(&imp->imp_replay_inflight); + return 0; + } + class_import_get(imp); task = kthread_run(ldlm_lock_replay_thread, imp, "ldlm_lock_replay"); if (IS_ERR(task)) { diff --git a/lustre/obdclass/obd_config.c b/lustre/obdclass/obd_config.c index d9fe07b..ae052f2 100644 --- a/lustre/obdclass/obd_config.c +++ b/lustre/obdclass/obd_config.c @@ -915,8 +915,8 @@ struct obd_device *class_incref(struct obd_device *obd, { lu_ref_add_atomic(&obd->obd_reference, scope, source); atomic_inc(&obd->obd_refcount); - CDEBUG(D_INFO, "incref %s (%p) now %d\n", obd->obd_name, obd, - atomic_read(&obd->obd_refcount)); + CDEBUG(D_INFO, "incref %s (%p) now %d - %s\n", obd->obd_name, obd, + atomic_read(&obd->obd_refcount), scope); return obd; }