Whamcloud - gitweb
LU-5568 lnet: fix kernel crash when network failed to start 18/11718/11
authorWang Shilong <wshilong@ddn.com>
Mon, 1 Sep 2014 20:44:38 +0000 (16:44 -0400)
committerOleg Drokin <oleg.drokin@intel.com>
Thu, 30 Oct 2014 02:59:19 +0000 (02:59 +0000)
commit8fab48a8ef0bad6961c2ca1e2959726252ba43ac
tree3ced8063c7c1f0ac42cc02fdbc45bb47e20b78d2
parent0a18a6a2b2c78a8079131848418a4dc8ebb594d1
LU-5568 lnet: fix kernel crash when network failed to start

When loading Lustre modules without proper network configuration,
it always hit the following kernel panic:

LNetError: 105-4: Error -100 starting up LNI tcp
LNetError: 2145:0:(api-ni.c:823:lnet_unprepare())
 ASSERTION( list_empty(&the_lnet.ln_nis) ) failed:
LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) LBUG
Pid: 2145, comm: modprobe
x0aCall Trace:
[<ffffffffa044f853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[<ffffffffa044fdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
[<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
[<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
[<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
[<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
[<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
[<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
[<ffffffff810020e2>] do_one_initcall+0xe2/0x190
[<ffffffff810ca7fb>] load_module+0x129b/0x1a90
[<ffffffff812da590>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
[<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
[<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
[<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
...

This is because in lnet_startup_lndnis(), we may add list items to
@the_lnet.ln_nis and @the_lnet.ln_nis_cpt before it failed. But in
lnet_startup_lndis() failure path,it did not cleanup list thus
causing assertion in lnet_unprepare().

Fix this problem by:
1) move lnet_shutdown_lndnis() back to lnet_startup_lndnis() so
that lnet_startup_lndnis() will cleanup itself.

2) move codes in lnet_startup_lndnis() that starts a single
NI into a new function called lnet_startup_lndni().

3)make lnet_dyn_add_ni() call lnet_startup_lndni() instead of
lnet_startup_lndnis().

This patch also fix problem LU-5734 addressed since they are
closely related.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Change-Id: I1082361626881e798fca49981fe92b4082769ecf
Reviewed-on: http://review.whamcloud.com/11718
Tested-by: Jenkins
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
lnet/lnet/api-ni.c
lnet/lnet/config.c