From: Aurelien Degremont Date: Wed, 23 Sep 2020 19:20:08 +0000 (+0000) Subject: LU-13984 ptlrpc: throttle RPC resend if network error X-Git-Tag: 2.13.57~101 X-Git-Url: https://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commitdiff_plain;h=4103527c1c9b38cb60c95a8f0ace2da1d246c3fc;ds=sidebyside LU-13984 ptlrpc: throttle RPC resend if network error When sending a callback AST to a non-responding client, the server retries endlessly until the client is eventually evicted. When using ksocklnd, it will retry after each AST timeout, until the socket is eventually closed, after sock_timeout sec, where the retry will fail immediately, returning -110, as no socket could be established. The thread will spin on retrying and failing, until eventual client eviction. This will cause high thread CPU usage and possible resource denial. To workaround that, this patch avoids re-trying callback resend if: - the request is flagged with network error and timeout - last try was less than 1 sec ago In worst case, retry will happen after a timeout based on req->rq_deadline. If there is nothing else to handle, thread will be sleeping during that time, removing CPU overhead. Signed-off-by: Aurelien Degremont Change-Id: Ie5028761c978b26e833fd0a5d30d313addf57984 Reviewed-on: https://review.whamcloud.com/40020 Reviewed-by: Andreas Dilger Tested-by: jenkins Tested-by: Maloo Reviewed-by: Alexander Boyko Reviewed-by: Oleg Drokin --- diff --git a/lustre/ptlrpc/client.c b/lustre/ptlrpc/client.c index a902a5b..368732a 100644 --- a/lustre/ptlrpc/client.c +++ b/lustre/ptlrpc/client.c @@ -2001,6 +2001,27 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set) GOTO(interpret, req->rq_status); } + /* don't resend too fast in case of network + * errors. + */ + if (ktime_get_real_seconds() < (req->rq_sent + 1) + && req->rq_net_err && req->rq_timedout) { + + DEBUG_REQ(D_INFO, req, + "throttle request"); + /* Don't try to resend RPC right away + * as it is likely it will fail again + * and ptlrpc_check_set() will be + * called again, keeping this thread + * busy. Instead, wait for the next + * timeout. Flag it as resend to + * ensure we don't wait to long. + */ + req->rq_resend = 1; + spin_unlock(&imp->imp_lock); + continue; + } + list_move_tail(&req->rq_list, &imp->imp_sending_list);