From 9bbd6eea1e237d9c0a734d88f4ec14812be6e236 Mon Sep 17 00:00:00 2001
From: Linda Bebernes <linda.bebernes@intel.com>
Date: Thu, 31 Oct 2013 14:14:30 -0700
Subject: [PATCH] LUDOC-50 LNET: Added section about LNET Peer Health to
 Chapter 26 "Lustre Tuning"

Added section 26.3.7 "LNET Peer Health". Added ID to Section 36.2.2 describing
keepalive information so that I could create a link from LNET peer health
content to that information.

Signed-off-by: Linda Bebernes <linda.bebernes@intel.com>
Change-Id: If1cd51125df8c01424b1a11bbf1aeff5959b1f40
Reviewed-on: http://review.whamcloud.com/8127
Tested-by: Hudson
Reviewed-by: Doug Oucharek <doug.s.oucharek@intel.com>
Reviewed-by: Richard Henwood <richard.henwood@intel.com>
---
 ConfigurationFilesModuleParameters.xml | 163 +++++++++++++++++++++++----------
 LustreTuning.xml                       |  65 ++++++++++++-
 2 files changed, 178 insertions(+), 50 deletions(-)
diff --git a/ConfigurationFilesModuleParameters.xml b/ConfigurationFilesModuleParameters.xml
index 0c68ebe..1093a14 100644
--- a/ConfigurationFilesModuleParameters.xml
+++ b/ConfigurationFilesModuleParameters.xml
@@ -267,12 +267,27 @@ forwarding (&quot;&quot;)</title>
         <para condition='123'><literal>rnet_htable_size</literal> is an integer that indicates how many remote networks the internal LNet hash table is configured to handle. <literal>rnet_htable_size</literal> is used for optimizing the hash table size and does not put a limit on how many remote networks you can have.  The default hash table size when this parameter is not specified is: 128.</para>
       </section>
     </section>
-    <section remap="h3">
-        <title><indexterm><primary>configuring</primary><secondary>network</secondary><tertiary>SOCKLND</tertiary></indexterm>
-<literal>SOCKLND</literal> Kernel TCP/IP LND</title>
-      <para>The <literal>SOCKLND</literal> kernel TCP/IP LND (<literal>socklnd</literal>) is connection-based and uses the acceptor to establish communications via sockets with its peers.</para>
-      <para>It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the <literal>ip2nets</literal> or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the <literal>socklnd</literal> encounters.</para>
-      <para>Consider a node on the &apos;edge&apos; of an InfiniBand network, with a low-bandwidth management Ethernet (<literal>eth0</literal>), IP over IB configured (<literal>ipoib0</literal>), and a pair of GigE NICs (<literal>eth1</literal>,<literal>eth2</literal>) providing off-cluster connectivity. This node should be configured with &apos;<literal>networks=vib,tcp(eth1,eth2)</literal>&apos; to ensure that the <literal>socklnd</literal> ignores the management Ethernet and IPoIB.</para>
+    <section remap="h3" xml:id="section_ngq_qhy_zl">
+      <title><indexterm>
+          <primary>configuring</primary>
+          <secondary>network</secondary>
+          <tertiary>SOCKLND</tertiary>
+        </indexterm>
+        <literal>SOCKLND</literal> Kernel TCP/IP LND</title>
+      <para>The <literal>SOCKLND</literal> kernel TCP/IP LND (<literal>socklnd</literal>) is
+        connection-based and uses the acceptor to establish communications via sockets with its
+        peers.</para>
+      <para>It supports multiple instances and load balances dynamically over multiple interfaces.
+        If no interfaces are specified by the <literal>ip2nets</literal> or networks module
+        parameter, all non-loopback IP interfaces are used. The address-within-network is determined
+        by the address of the first IP interface an instance of the <literal>socklnd</literal>
+        encounters.</para>
+      <para>Consider a node on the &apos;edge&apos; of an InfiniBand network, with a low-bandwidth
+        management Ethernet (<literal>eth0</literal>), IP over IB configured
+          (<literal>ipoib0</literal>), and a pair of GigE NICs
+          (<literal>eth1</literal>,<literal>eth2</literal>) providing off-cluster connectivity. This
+        node should be configured with &apos;<literal>networks=vib,tcp(eth1,eth2)</literal>&apos; to
+        ensure that the <literal>socklnd</literal> ignores the management Ethernet and IPoIB.</para>
       <informaltable frame="all">
         <tgroup cols="2">
           <colspec colname="c1" colwidth="50*"/>
@@ -290,17 +305,22 @@ forwarding (&quot;&quot;)</title>
           <tbody>
             <row>
               <entry>
-                <para> <literal>timeout</literal></para>
-                <para> <literal>(50,W)</literal></para>
+                <para>
+                  <literal>timeout</literal></para>
+                <para>
+                  <literal>(50,W)</literal></para>
               </entry>
               <entry>
-                <para>Time (in seconds) that communications may be stalled before the LND completes them with failure.</para>
+                <para>Time (in seconds) that communications may be stalled before the LND completes
+                  them with failure.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>nconnds</literal></para>
-                <para> <literal>(4)</literal></para>
+                <para>
+                  <literal>nconnds</literal></para>
+                <para>
+                  <literal>(4)</literal></para>
               </entry>
               <entry>
                 <para>Sets the number of connection daemons.</para>
@@ -308,17 +328,24 @@ forwarding (&quot;&quot;)</title>
             </row>
             <row>
               <entry>
-                <para> <literal>min_reconnectms</literal></para>
-                <para> <literal>(1000,W)</literal></para>
+                <para>
+                  <literal>min_reconnectms</literal></para>
+                <para>
+                  <literal>(1000,W)</literal></para>
               </entry>
               <entry>
-                <para>Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of &apos;<literal>max_reconnectms</literal>&apos;.</para>
+                <para>Minimum connection retry interval (in milliseconds). After a failed connection
+                  attempt, this is the time that must elapse before the first retry. As connections
+                  attempts fail, this time is doubled on each successive retry up to a maximum of
+                    &apos;<literal>max_reconnectms</literal>&apos;.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>max_reconnectms</literal></para>
-                <para> <literal>(6000,W)</literal></para>
+                <para>
+                  <literal>max_reconnectms</literal></para>
+                <para>
+                  <literal>(6000,W)</literal></para>
               </entry>
               <entry>
                 <para>Maximum connection retry interval (in milliseconds).</para>
@@ -326,27 +353,38 @@ forwarding (&quot;&quot;)</title>
             </row>
             <row>
               <entry>
-                <para> <literal>eager_ack</literal></para>
-                <para> <literal>(0 on linux,</literal></para>
-                <para> <literal>1 on darwin,W)</literal></para>
+                <para>
+                  <literal>eager_ack</literal></para>
+                <para>
+                  <literal>(0 on linux,</literal></para>
+                <para>
+                  <literal>1 on darwin,W)</literal></para>
               </entry>
               <entry>
-                <para>Boolean that determines whether the <literal>socklnd</literal> should attempt to flush sends on message boundaries.</para>
+                <para>Boolean that determines whether the <literal>socklnd</literal> should attempt
+                  to flush sends on message boundaries.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>typed_conns</literal></para>
-                <para> <literal>(1,Wc)</literal></para>
+                <para>
+                  <literal>typed_conns</literal></para>
+                <para>
+                  <literal>(1,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Boolean that determines whether the <literal>socklnd</literal> should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else.</para>
+                <para>Boolean that determines whether the <literal>socklnd</literal> should use
+                  different sockets for different types of messages. When clear, all communication
+                  with a particular peer takes place on the same socket. Otherwise, separate sockets
+                  are used for bulk sends, bulk receives and everything else.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>min_bulk</literal></para>
-                <para> <literal>(1024,W)</literal></para>
+                <para>
+                  <literal>min_bulk</literal></para>
+                <para>
+                  <literal>(1024,W)</literal></para>
               </entry>
               <entry>
                 <para>Determines when a message is considered &quot;bulk&quot;.</para>
@@ -354,69 +392,98 @@ forwarding (&quot;&quot;)</title>
             </row>
             <row>
               <entry>
-                <para> <literal>tx_buffer_size, rx_buffer_size</literal></para>
-                <para> <literal>(8388608,Wc)</literal></para>
+                <para>
+                  <literal>tx_buffer_size, rx_buffer_size</literal></para>
+                <para>
+                  <literal>(8388608,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes. </para>
+                <para>Socket buffer sizes. Setting this option to zero (0), allows the system to
+                  auto-tune buffer sizes. </para>
                 <warning>
-                  <para>Be very careful changing this value as improper sizing can harm performance.</para>
+                  <para>Be very careful changing this value as improper sizing can harm
+                    performance.</para>
                 </warning>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>nagle</literal></para>
-                <para> <literal>(0,Wc)</literal></para>
+                <para>
+                  <literal>nagle</literal></para>
+                <para>
+                  <literal>(0,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Boolean that determines if <literal>nagle</literal> should be enabled. It should never be set in production systems.</para>
+                <para>Boolean that determines if <literal>nagle</literal> should be enabled. It
+                  should never be set in production systems.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>keepalive_idle</literal></para>
-                <para> <literal>(30,Wc)</literal></para>
+                <para>
+                  <literal>keepalive_idle</literal></para>
+                <para>
+                  <literal>(30,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives.</para>
+                <para>Time (in seconds) that a socket can remain idle before a keepalive probe is
+                  sent. Setting this value to zero (0) disables keepalives.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>keepalive_intvl</literal></para>
-                <para> <literal>(2,Wc)</literal></para>
+                <para>
+                  <literal>keepalive_intvl</literal></para>
+                <para>
+                  <literal>(2,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives.</para>
+                <para>Time (in seconds) to repeat unanswered keepalive probes. Setting this value to
+                  zero (0) disables keepalives.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>keepalive_count</literal></para>
-                <para> <literal>(10,Wc)</literal></para>
+                <para>
+                  <literal>keepalive_count</literal></para>
+                <para>
+                  <literal>(10,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Number of unanswered keepalive probes before pronouncing socket (hence peer) death.</para>
+                <para>Number of unanswered keepalive probes before pronouncing socket (hence peer)
+                  death.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>enable_irq_affinity</literal></para>
-                <para> <literal>(0,Wc)</literal></para>
+                <para>
+                  <literal>enable_irq_affinity</literal></para>
+                <para>
+                  <literal>(0,Wc)</literal></para>
               </entry>
               <entry>
-                <para>Boolean that determines whether to enable IRQ affinity. The default is zero (0).</para>
-                <para>When set, <literal>socklnd</literal> attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled.</para>
+                <para>Boolean that determines whether to enable IRQ affinity. The default is zero
+                  (0).</para>
+                <para>When set, <literal>socklnd</literal> attempts to maximize performance by
+                  handling device interrupts and data movement for particular (hardware) interfaces
+                  on particular CPUs. This option is not available on all platforms. This option
+                  requires an SMP system to exist and produces best performance with multiple NICs.
+                  Systems with multiple CPUs and a single NIC may see increase in the performance
+                  with this parameter disabled.</para>
               </entry>
             </row>
             <row>
               <entry>
-                <para> <literal>zc_min_frag</literal></para>
-                <para> <literal>(2048,W)</literal></para>
+                <para>
+                  <literal>zc_min_frag</literal></para>
+                <para>
+                  <literal>(2048,W)</literal></para>
               </entry>
               <entry>
-                <para>Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform&apos;s <literal>PAGE_SIZE </literal>disables all zero copy sends. This option is not available on all platforms.</para>
+                <para>Determines the minimum message fragment that should be considered for
+                  zero-copy sends. Increasing it above the platform&apos;s <literal>PAGE_SIZE
+                  </literal>disables all zero copy sends. This option is not available on all
+                  platforms.</para>
               </entry>
             </row>
           </tbody>
diff --git a/LustreTuning.xml b/LustreTuning.xml
index 79ab848..40d7fbd 100644
--- a/LustreTuning.xml
+++ b/LustreTuning.xml
@@ -1,5 +1,4 @@
-<?xml version='1.0' encoding='UTF-8'?>
-<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretuning">
+<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretuning">
   <title xml:id="lustretuning.title">Lustre Tuning</title>
   <para>This chapter contains information about tuning Lustre for better performance and includes the following sections:</para>
   <itemizedlist>
@@ -246,6 +245,68 @@
     </itemizedlist>
 
     </section>
+    <section>
+      <title>LNET Peer Health</title>
+      <para>Two options are available to help determine peer health:<itemizedlist>
+          <listitem>
+            <para><literal>peer_timeout</literal> - The timeout (in seconds) before an aliveness
+              query is sent to a peer. For example, if <literal>peer_timeout</literal> is set to
+                <literal>180sec</literal>, an aliveness query is sent to the peer every 180 seconds.
+              This feature only takes effect if the node is configured as an LNET router.</para>
+            <para>In a routed environment, the <literal>peer_timeout</literal> feature should always
+              be on (set to a value in seconds) on routers. If the router checker has been enabled,
+              the feature should be turned off by setting it to 0 on clients and servers.</para>
+            <para>For a non-routed scenario, enabling the <literal>peer_timeout</literal> option
+              provides health information such as whether a peer is alive or not. For example, a
+              client is able to determine if an MGS or OST is up when it sends it a message. If a
+              response is received, the peer is alive; otherwise a timeout occurs when the request
+              is made.</para>
+            <para>In general, <literal>peer_timeout</literal> should be set to no less than the LND
+              timeout setting. For more information about LND timeouts, see <xref
+                xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/>.</para>
+            <para>When the <literal>o2iblnd</literal> (IB) driver is used,
+                <literal>peer_timeout</literal> should be at least twice the value of the
+                <literal>ko2iblnd</literal> keepalive option. for more information about keepalive
+              options, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+                linkend="section_ngq_qhy_zl"/>.</para>
+          </listitem>
+          <listitem>
+            <para><literal>avoid_asym_router_failure</literal> â When set to 1, the router checker
+              running on the client or a server periodically pings all the routers corresponding to
+              the NIDs identified in the routes parameter setting on the node to determine the
+              status of each router interface. The default setting is 1. (For more information about
+              the LNET routes parameter, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+                linkend="dbdoclet.50438216_71227"/></para>
+            <para>A router is considered down if any of its NIDs are down. For example, router X has
+              three NIDs: <literal>Xnid1</literal>, <literal>Xnid2</literal>, and
+                <literal>Xnid3</literal>. A client is connected to the router via
+                <literal>Xnid1</literal>. The client has router checker enabled. The router checker
+              periodically sends a ping to the router via <literal>Xnid1</literal>. The router
+              responds to the ping with the status of each of its NIDs. In this case, it responds
+              with <literal>Xnid1=up</literal>, <literal>Xnid2=up</literal>,
+                <literal>Xnid3=down</literal>. If <literal>avoid_asym_router_failure==1</literal>,
+              the router is considered down if any of its NIDs are down, so router X is considered
+              down and will not be used for routing messages. If
+                <literal>avoid_asym_router_failure==0</literal>, router X will continue to be used
+              for routing messages.</para>
+          </listitem>
+        </itemizedlist></para>
+      <para>The following router checker parameters must be set to the maximum value of the
+        corresponding setting for this option on any client or server:<itemizedlist>
+          <listitem>
+            <para><literal>dead_router_check_interval</literal></para>
+          </listitem>
+          <listitem>
+            <para>
+              <literal>live_router_check_interval</literal></para>
+          </listitem>
+          <listitem>
+            <para><literal>router_ping_timeout</literal></para>
+          </listitem>
+        </itemizedlist></para>
+      <para>For example, the <literal>dead_router_check_interval</literal> parameter on any router
+        must be MAX.</para>
+    </section>
   </section>
   <section xml:id="dbdoclet.libcfstuning">
       <title><indexterm><primary>tuning</primary><secondary>libcfs</secondary></indexterm>libcfs Tuning</title>
-- 
1.8.3.1