From: Bruno Faccini Date: Fri, 3 Feb 2017 16:55:37 +0000 (+0100) Subject: LU-9075 mdt: avoid race causing mdt_coordinator_cb() err msgs X-Git-Tag: 2.9.57~79 X-Git-Url: https://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commitdiff_plain;h=d1535dc90b01770e56a0c79c7bb1e7c9cd8f1c6a LU-9075 mdt: avoid race causing mdt_coordinator_cb() err msgs This patch mainly moves mdt_agent_record_update() call before mdt_cdt_remove_request() in mdt_hsm_update_request_state(), to avoid the frequent couple of "(mdt_coordinator.c:1473:mdt_hsm_update_request_state()) ... Cannot find running request for cookie ..." and "(mdt_coordinator.c:339:mdt_coordinator_cb()) ... cannot cleanup timed out request ..." error msgs, likely to concern active requests that have completed and thus that have already been removed from memory in mdt_hsm_update_request_state() (using mdt_cdt_remove_request() and in the context of a MDT thread handling CT's MDS_HSM_PROGRESS requests), but the corresponding action LLOG record update is stuck awaiting for CDT to give-back cdt_llog_lock in mdt_agent_record_update(). Others related but minor changes are, use of arr_req_change instead of arr_req_create to more accuratelly determine if a request exceeds the timeout, and change main debug msg in mdt_hsm_update_request_state() to reflect if action LLOG record update will occur or not. Signed-off-by: Bruno Faccini Change-Id: I043813f1ff11a7e9e99c534fa8560a35e2c52543 Reviewed-on: https://review.whamcloud.com/25243 Tested-by: Jenkins Tested-by: Maloo Reviewed-by: Henri Doreau Reviewed-by: Quentin Bouget Reviewed-by: Oleg Drokin --- diff --git a/lustre/mdt/mdt_coordinator.c b/lustre/mdt/mdt_coordinator.c index c563284..445bf7d 100644 --- a/lustre/mdt/mdt_coordinator.c +++ b/lustre/mdt/mdt_coordinator.c @@ -284,7 +284,7 @@ static int mdt_coordinator_cb(const struct lu_env *env, */ car = mdt_cdt_find_request(cdt, larr->arr_hai.hai_cookie, NULL); if (car == NULL) { - last = larr->arr_req_create; + last = larr->arr_req_change; } else { last = car->car_req_update; mdt_cdt_put_request(car); @@ -1427,15 +1427,14 @@ int mdt_hsm_update_request_state(struct mdt_thread_info *mti, rc = hsm_cdt_request_completed(mti, pgs, car, &status); - /* remove request from memory list */ - mdt_cdt_remove_request(cdt, pgs->hpk_cookie); - - CDEBUG(D_HSM, "Updating record: fid="DFID" cookie=%#llx" - " action=%s status=%s\n", PFID(&pgs->hpk_fid), - pgs->hpk_cookie, + CDEBUG(D_HSM, "%s record: fid="DFID" cookie=%#llx action=%s " + "status=%s\n", + update_record ? "Updating" : "Not updating", + PFID(&pgs->hpk_fid), pgs->hpk_cookie, hsm_copytool_action2name(car->car_hai->hai_action), agent_req_status2name(status)); + /* update record first (LU-9075) */ if (update_record) { int rc1; @@ -1451,6 +1450,10 @@ int mdt_hsm_update_request_state(struct mdt_thread_info *mti, pgs->hpk_cookie); rc = (rc != 0 ? rc : rc1); } + + /* then remove request from memory list (LU-9075) */ + mdt_cdt_remove_request(cdt, pgs->hpk_cookie); + /* ct has completed a request, so a slot is available, wakeup * cdt to find new work */ mdt_hsm_cdt_wakeup(mdt);