Whamcloud - gitweb
b=23076 fix for o2iblnd reconnect to retry one more time
authorMaxim Patlasov <maxim.patlasov@sun.com>
Wed, 23 Jun 2010 15:01:19 +0000 (17:01 +0200)
committerJohann Lombardi <johann@sun.com>
Wed, 23 Jun 2010 15:01:19 +0000 (17:01 +0200)
i=isaac

With peer health detection, o2iblnd makes only one attempt to reconnect
which is not enough with nodes running lustre 1.6 because of  proto version
mismatch.

lnet/klnds/o2iblnd/o2iblnd_cb.c

index 638ffc5..b6cd310 100644 (file)
@@ -2348,8 +2348,12 @@ kiblnd_reconnect (kib_conn_t *conn, int version,
         write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
 
         /* retry connection if it's still needed and no other connection
-         * attempts (active or passive) are in progress */
-        if (!list_empty(&peer->ibp_tx_queue) &&
+         * attempts (active or passive) are in progress
+         * NB: reconnect is still needed even when ibp_tx_queue is
+         * empty if ibp_version != version because reconnect may be
+         * initiated by kiblnd_query() */
+        if ((!list_empty(&peer->ibp_tx_queue) ||
+             peer->ibp_version != version) &&
             peer->ibp_connecting == 1 &&
             peer->ibp_accepting == 0) {
                 retry = 1;