Whamcloud - gitweb
LU-1239 ldlm: cascading client reconnects
authorVitaly Fertman <vitaly_fertman@xyratex.com>
Tue, 26 Jun 2012 11:37:47 +0000 (15:37 +0400)
committerOleg Drokin <green@whamcloud.com>
Thu, 5 Jul 2012 18:26:42 +0000 (14:26 -0400)
commite976ee72602477e54d26693aaeb84709ea5fd38f
tree208eb2e54973695c82840b66ebdd679ec84fb330
parent7358bc93e63ca3eff1cde4f73d0e1d81896b1eaa
LU-1239 ldlm: cascading client reconnects

It may happen that
- MDS is overloaded with enqueues, they consume all the threads on
  MDS_REQUEST portal and waiting for a lock a client holds;
- that client tries to re-connect but MDS is out of threads and
  re-connection fails;
- other clients are waiting for their enqueue completions, they try
  to ping MDS if it is still alive, but despite the fact it is a HP-rpc,
  there is no thread reserved for it. Thus, other clients get timed
  out as well.

Ensure each service which handles HP-rpc has an extra thread reserved
for them; make MDS_CONNECT and OST_CONNECT HP-rpc.

Change-Id: I295aec6a2d2fb614d4b5f037068a3dcdda1a8b09
Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com>
Reviewed-by: Alexey Lyashkov <alexey_lyashkov@xyratex.com>
Reviewed-by: Andrew Perepechko <Andrew_Perepechko@xyratex.com>
Xyratex-bug-id: MRP-455
Reviewed-on: http://review.whamcloud.com/2355
Tested-by: Hudson
Tested-by: Maloo <whamcloud.maloo@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lustre/include/lustre_net.h
lustre/ldlm/ldlm_lockd.c
lustre/mdt/mdt_handler.c
lustre/ost/ost_handler.c
lustre/ptlrpc/events.c
lustre/ptlrpc/service.c