Whamcloud - gitweb
LU-7210 lnet: Change connect peer failed cleanup order 04/17004/2
authorDoug Oucharek <doug.s.oucharek@intel.com>
Sat, 31 Oct 2015 01:11:31 +0000 (18:11 -0700)
committerOleg Drokin <oleg.drokin@intel.com>
Fri, 8 Jan 2016 13:49:07 +0000 (13:49 +0000)
commit53737bf2055cddabf53140dce08d0e1009aab483
treea2347dbbb2f32b76554dcf869d257deeef977e4d
parent3efb7683679ab2d18b4d2b256acd462596324d9c
LU-7210 lnet: Change connect peer failed cleanup order

A race condition has been found where connd is cleaning up failed
connections, the peer ref counter goes to zero, but we stil have
a connecting counter > 0.

One possible race is when we are retrying a connection by
calling kiblnd_connect_peer() which itself fails and decrements
the peer ref counter and gets swapped out before it can decrement
the connecting counter.  connd swaps in and cleans up the
connection where it sees a peer ref counter of 1 and a connecting
counter of 1.  This will trigger the assert seen in LU-7210 when
it decrements the peer counter.

The solution: be sure to decrement the connecting counter
before decrementing the peer counter in the peer connect
failure path.

Signed-off-by: Doug Oucharek <doug.s.oucharek@intel.com>
Change-Id: I2d6ddeae80ac72492a4323a730e3e61c876ebb36
Reviewed-on: http://review.whamcloud.com/17004
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
lnet/klnds/o2iblnd/o2iblnd_cb.c