From 53737bf2055cddabf53140dce08d0e1009aab483 Mon Sep 17 00:00:00 2001 From: Doug Oucharek Date: Fri, 30 Oct 2015 18:11:31 -0700 Subject: [PATCH] LU-7210 lnet: Change connect peer failed cleanup order A race condition has been found where connd is cleaning up failed connections, the peer ref counter goes to zero, but we stil have a connecting counter > 0. One possible race is when we are retrying a connection by calling kiblnd_connect_peer() which itself fails and decrements the peer ref counter and gets swapped out before it can decrement the connecting counter. connd swaps in and cleans up the connection where it sees a peer ref counter of 1 and a connecting counter of 1. This will trigger the assert seen in LU-7210 when it decrements the peer counter. The solution: be sure to decrement the connecting counter before decrementing the peer counter in the peer connect failure path. Signed-off-by: Doug Oucharek Change-Id: I2d6ddeae80ac72492a4323a730e3e61c876ebb36 Reviewed-on: http://review.whamcloud.com/17004 Tested-by: Jenkins Tested-by: Maloo Reviewed-by: James Simmons Reviewed-by: Amir Shehata Reviewed-by: Oleg Drokin --- lnet/klnds/o2iblnd/o2iblnd_cb.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/lnet/klnds/o2iblnd/o2iblnd_cb.c b/lnet/klnds/o2iblnd/o2iblnd_cb.c index 8ef29f3..9567d0e 100644 --- a/lnet/klnds/o2iblnd/o2iblnd_cb.c +++ b/lnet/klnds/o2iblnd/o2iblnd_cb.c @@ -1290,13 +1290,15 @@ kiblnd_connect_peer (kib_peer_t *peer) libcfs_nid2str(peer->ibp_nid), dev->ibd_ifname, &dev->ibd_ifip, cmid->device->name); - return; + return; failed2: - kiblnd_peer_decref(peer); /* cmid's ref */ - rdma_destroy_id(cmid); + kiblnd_peer_connect_failed(peer, 1, rc); + kiblnd_peer_decref(peer); /* cmid's ref */ + rdma_destroy_id(cmid); + return; failed: - kiblnd_peer_connect_failed(peer, 1, rc); + kiblnd_peer_connect_failed(peer, 1, rc); } void -- 1.8.3.1