Whamcloud - gitweb
LU-7578 gnilnd: Add module parameter reg_fail_timeout 64/17664/4
authorChuck Fossen <chuckf@cray.com>
Mon, 1 Feb 2016 23:46:00 +0000 (18:46 -0500)
committerOleg Drokin <oleg.drokin@intel.com>
Fri, 5 Feb 2016 14:57:01 +0000 (14:57 +0000)
commit5b787cb7a375372c7a4f3c405d38137a7a867677
tree2bd400d00d228a86f71290eba59217f212d6d862
parenta62050bbcf70831f3c16b5c61a04816c1296909b
LU-7578 gnilnd: Add module parameter reg_fail_timeout

During network outages on very large machines, it is possible to use
up all of GART space with connections that are in purgatory waiting
to be freed when we finally make a new connection.
This mod adds a timeout parameter so that when we fail registering
memory for fma blocks for a period of time, we can bring the node down
so it is not stuck in a state of being up but unusable.
This can only happen on service nodes as there can potentially be 10s
of thousands of connections.
A recommended setting for reg_fail_timeout would be 60 - 300 seconds.
The default setting for reg_fail_timeout is -1 (disabled).

Set fail_loc 0xf002 which fails memory registrations and see that we
BUG after the required timeout.
Test that transient registration failures within the timeout period
do not cause BUG.

Signed-off-by: Chris Horn <hornc@cray.com>
Signed-off-by: Chuck Fossen <chuckf@cray.com>
Change-Id: I214b5e5a297c547f3c4675fcc263e5dd8aaed24f
Reviewed-on: http://review.whamcloud.com/17664
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Tested-by: James Simmons <uja.ornl@yahoo.com>
Tested-by: Jenkins
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
lnet/klnds/gnilnd/gnilnd.h
lnet/klnds/gnilnd/gnilnd_conn.c
lnet/klnds/gnilnd/gnilnd_modparams.c