LNetMultiRail.xml

   1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lnetmr" condition='l2A'>
   2   <title xml:id="lnetmr.title">LNet Software Multi-Rail</title>
   3   <para>This chapter describes LNet Software Multi-Rail configuration and
   4   administration.</para>
   5   <itemizedlist>
   6     <listitem>
   7       <para><xref linkend="dbdoclet.mroverview"/></para>
   8       <para><xref linkend="dbdoclet.mrconfiguring"/></para>
   9       <para><xref linkend="dbdoclet.mrrouting"/></para>
  10       <para><xref linkend="mrrouting.health"/></para>
  11       <para><xref linkend="dbdoclet.mrhealth"/></para>
  12     </listitem>
  13   </itemizedlist>
  14   <section xml:id="dbdoclet.mroverview">
  15     <title><indexterm><primary>MR</primary><secondary>overview</secondary>
  16     </indexterm>Multi-Rail Overview</title>
  17     <para>In computer networking, multi-rail is an arrangement in which two or
  18     more network interfaces to a single network on a computer node are employed,
  19     to achieve increased throughput.  Multi-rail can also be where a node has
  20     one or more interfaces to multiple, even different kinds of networks, such
  21     as Ethernet, Infiniband, and Intel® Omni-Path. For Lustre clients,
  22     multi-rail generally presents the combined network capabilities as a single
  23     LNet network.  Peer nodes that are multi-rail capable are established during
  24     configuration, as are user-defined interface-section policies.</para>
  25     <para>The following link contains a detailed high-level design for the
  26     feature:
  27     <link xl:href="http://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf">
  28     Multi-Rail High-Level Design</link></para>
  29   </section>
  30   <section xml:id="dbdoclet.mrconfiguring">
  31     <title><indexterm><primary>MR</primary><secondary>configuring</secondary>
  32     </indexterm>Configuring Multi-Rail</title>
  33     <para>Every node using multi-rail networking needs to be properly
  34     configured.  Multi-rail uses <literal>lnetctl</literal> and the LNet
  35     Configuration Library for configuration.  Configuring multi-rail for a
  36     given node involves two tasks:</para>
  37     <orderedlist>
  38       <listitem><para>Configuring multiple network interfaces present on the
  39       local node.</para></listitem>
  40       <listitem><para>Adding remote peers that are multi-rail capable (are
  41       connected to one or more common networks with at least two interfaces).
  42       </para></listitem>
  43     </orderedlist>
  44     <para>This section is a supplement to
  45       <xref linkend="lnet_config.lnetaddshowdelete" /> and contains further
  46       examples for Multi-Rail configurations.</para>
  47     <para>For information on the dynamic peer discovery feature added in
  48       Lustre Release 2.11.0, see
  49       <xref linkend="lnet_config.dynamic_discovery" />.</para>
  50     <section xml:id="dbdoclet.addinterfaces">
  51       <title><indexterm><primary>MR</primary>
  52       <secondary>multipleinterfaces</secondary>
  53       </indexterm>Configure Multiple Interfaces on the Local Node</title>
  54       <para>Example <literal>lnetctl add</literal> command with multiple
  55       interfaces in a Multi-Rail configuration:</para>
  56       <screen>lnetctl net add --net tcp --if eth0,eth1</screen>
  57       <para>Example of YAML net show:</para>
  58       <screen>lnetctl net show -v
  59 net:
  60     - net type: lo
  61       local NI(s):
  62         - nid: 0@lo
  63           status: up
  64           statistics:
  65               send_count: 0
  66               recv_count: 0
  67               drop_count: 0
  68           tunables:
  69               peer_timeout: 0
  70               peer_credits: 0
  71               peer_buffer_credits: 0
  72               credits: 0
  73           lnd tunables:
  74           tcp bonding: 0
  75           dev cpt: 0
  76           CPT: "[0]"
  77     - net type: tcp
  78       local NI(s):
  79         - nid: 192.168.122.10@tcp
  80           status: up
  81           interfaces:
  82               0: eth0
  83           statistics:
  84               send_count: 0
  85               recv_count: 0
  86               drop_count: 0
  87           tunables:
  88               peer_timeout: 180
  89               peer_credits: 8
  90               peer_buffer_credits: 0
  91               credits: 256
  92           lnd tunables:
  93           tcp bonding: 0
  94           dev cpt: -1
  95           CPT: "[0]"
  96         - nid: 192.168.122.11@tcp
  97           status: up
  98           interfaces:
  99               0: eth1
 100           statistics:
 101               send_count: 0
 102               recv_count: 0
 103               drop_count: 0
 104           tunables:
 105               peer_timeout: 180
 106               peer_credits: 8
 107               peer_buffer_credits: 0
 108               credits: 256
 109           lnd tunables:
 110           tcp bonding: 0
 111           dev cpt: -1
 112           CPT: "[0]"</screen>
 113     </section>
 114     <section xml:id="dbdoclet.deleteinterfaces">
 115       <title><indexterm><primary>MR</primary>
 116         <secondary>deleteinterfaces</secondary>
 117         </indexterm>Deleting Network Interfaces</title>
 118       <para>Example delete with <literal>lnetctl net del</literal>:</para>
 119       <para>Assuming the network configuration is as shown above with the
 120       <literal>lnetctl net show -v</literal> in the previous section, we can
 121       delete a net with following command:</para>
 122       <screen>lnetctl net del --net tcp --if eth0</screen>
 123       <para>The resultant net information would look like:</para>
 124       <screen>lnetctl net show -v
 125 net:
 126     - net type: lo
 127       local NI(s):
 128         - nid: 0@lo
 129           status: up
 130           statistics:
 131               send_count: 0
 132               recv_count: 0
 133               drop_count: 0
 134           tunables:
 135               peer_timeout: 0
 136               peer_credits: 0
 137               peer_buffer_credits: 0
 138               credits: 0
 139           lnd tunables:
 140           tcp bonding: 0
 141           dev cpt: 0
 142           CPT: "[0,1,2,3]"</screen>
 143       <para>The syntax of a YAML file to perform a delete would be:</para>
 144       <screen>- net type: tcp
 145    local NI(s):
 146      - nid: 192.168.122.10@tcp
 147        interfaces:
 148            0: eth0</screen>
 149     </section>
 150     <section xml:id="dbdoclet.addremotepeers">
 151       <title><indexterm><primary>MR</primary>
 152         <secondary>addremotepeers</secondary>
 153         </indexterm>Adding Remote Peers that are Multi-Rail Capable</title>
 154       <para>The following example <literal>lnetctl peer add</literal>
 155       command adds a peer with 2 nids, with
 156         <literal>192.168.122.30@tcp</literal> being the primary nid:</para>
 157       <screen>lnetctl peer add --prim_nid 192.168.122.30@tcp --nid 192.168.122.30@tcp,192.168.122.31@tcp
 158       </screen>
 159       <para>The resulting <literal>lnetctl peer show</literal> would be:
 160         <screen>lnetctl peer show -v
 161 peer:
 162     - primary nid: 192.168.122.30@tcp
 163       Multi-Rail: True
 164       peer ni:
 165         - nid: 192.168.122.30@tcp
 166           state: NA
 167           max_ni_tx_credits: 8
 168           available_tx_credits: 8
 169           min_tx_credits: 7
 170           tx_q_num_of_buf: 0
 171           available_rtr_credits: 8
 172           min_rtr_credits: 8
 173           refcount: 1
 174           statistics:
 175               send_count: 2
 176               recv_count: 2
 177               drop_count: 0
 178         - nid: 192.168.122.31@tcp
 179           state: NA
 180           max_ni_tx_credits: 8
 181           available_tx_credits: 8
 182           min_tx_credits: 7
 183           tx_q_num_of_buf: 0
 184           available_rtr_credits: 8
 185           min_rtr_credits: 8
 186           refcount: 1
 187           statistics:
 188               send_count: 1
 189               recv_count: 1
 190               drop_count: 0</screen>
 191       </para>
 192       <para>The following is an example YAML file for adding a peer:</para>
 193       <screen>addPeer.yaml
 194 peer:
 195     - primary nid: 192.168.122.30@tcp
 196       Multi-Rail: True
 197       peer ni:
 198         - nid: 192.168.122.31@tcp</screen>
 199     </section>
 200     <section xml:id="dbdoclet.deleteremotepeers">
 201       <title><indexterm><primary>MR</primary>
 202         <secondary>deleteremotepeers</secondary>
 203         </indexterm>Deleting Remote Peers</title>
 204       <para>Example of deleting a single nid of a peer (192.168.122.31@tcp):
 205       </para>
 206       <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp --nid 192.168.122.31@tcp</screen>
 207       <para>Example of deleting the entire peer:</para>
 208       <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp</screen>
 209       <para>Example of deleting a peer via YAML:</para>
 210       <screen>Assuming the following peer configuration:
 211 peer:
 212     - primary nid: 192.168.122.30@tcp
 213       Multi-Rail: True
 214       peer ni:
 215         - nid: 192.168.122.30@tcp
 216           state: NA
 217         - nid: 192.168.122.31@tcp
 218           state: NA
 219         - nid: 192.168.122.32@tcp
 220           state: NA
 221
 222 You can delete 192.168.122.32@tcp as follows:
 223
 224 delPeer.yaml
 225 peer:
 226     - primary nid: 192.168.122.30@tcp
 227       Multi-Rail: True
 228       peer ni:
 229         - nid: 192.168.122.32@tcp
 230
 231 % lnetctl import --del &lt; delPeer.yaml</screen>
 232     </section>
 233   </section>
 234   <section xml:id="dbdoclet.mrrouting">
 235     <title><indexterm><primary>MR</primary>
 236       <secondary>mrrouting</secondary>
 237       </indexterm>Notes on routing with Multi-Rail</title>
 238     <para>This section details how to configure Multi-Rail with the routing
 239     feature before the <xref linkend="mrrouting.health" /> feature landed in
 240     Lustre 2.13. Routing code has always monitored the state of the route, in
 241     order to avoid using unavailable ones.</para>
 242     <para>This section describes how you can configure multiple interfaces on
 243     the same gateway node but as different routes. This uses the existing route
 244     monitoring algorithm to guard against interfaces going down.  With the
 245     <xref linkend="mrrouting.health" /> feature introduced in Lustre 2.13, the
 246     new algorithm uses the <xref linkend="dbdoclet.mrhealth" /> feature to
 247     monitor the different interfaces of the gateway and always ensures that the
 248     healthiest interface is used. Therefore, the configuration described in this
 249     section applies to releases prior to Lustre 2.13.  It will still work in
 250     2.13 as well, however it is not required due to the reason mentioned above.
 251     </para>
 252     <section xml:id="dbdoclet.mrroutingex">
 253       <title><indexterm><primary>MR</primary>
 254         <secondary>mrrouting</secondary>
 255         <tertiary>routingex</tertiary>
 256         </indexterm>Multi-Rail Cluster Example</title>
 257       <para>The below example outlines a simple system where all the Lustre
 258       nodes are MR capable.  Each node in the cluster has two interfaces.</para>
 259       <figure xml:id="lnetmultirail.fig.routingdiagram">
 260         <title>Routing Configuration with Multi-Rail</title>
 261         <mediaobject>
 262           <imageobject>
 263             <imagedata scalefit="1" width="100%"
 264             fileref="./figures/MR_RoutingConfig.png" />
 265           </imageobject>
 266           <textobject>
 267             <phrase>Routing Configuration with Multi-Rail</phrase>
 268           </textobject>
 269         </mediaobject>
 270       </figure>
 271       <para>The routers can aggregate the interfaces on each side of the network
 272       by configuring them on the appropriate network.</para>
 273       <para>An example configuration:</para>
 274       <screen>Routers
 275 lnetctl net add --net o2ib0 --if ib0,ib1
 276 lnetctl net add --net o2ib1 --if ib2,ib3
 277 lnetctl peer add --nid &lt;peer1-nidA&gt;@o2ib,&lt;peer1-nidB&gt;@o2ib,...
 278 lnetctl peer add --nid &lt;peer2-nidA&gt;@o2ib1,&lt;peer2-nidB>&gt;@o2ib1,...
 279 lnetctl set routing 1
 280
 281 Clients
 282 lnetctl net add --net o2ib0 --if ib0,ib1
 283 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidA&gt;@o2ib
 284 lnetctl peer add --nid &lt;rtrX-nidA&gt;@o2ib,&lt;rtrX-nidB&gt;@o2ib
 285
 286 Servers
 287 lnetctl net add --net o2ib1 --if ib0,ib1
 288 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidA&gt;@o2ib1
 289 lnetctl peer add --nid &lt;rtrX-nidA&gt;@o2ib1,&lt;rtrX-nidB&gt;@o2ib1</screen>
 290       <para>In the above configuration the clients and the servers are
 291       configured with only one route entry per router. This works because the
 292       routers are MR capable. By adding the routers as peers with multiple
 293       interfaces to the clients and the servers, when sending to the router the
 294       MR algorithm will ensure that bot interfaces of the routers are used.
 295       </para>
 296       <para>However, as of the Lustre 2.10 release LNet Resiliency is still
 297       under development and single interface failure will still cause the entire
 298       router to go down.</para>
 299     </section>
 300     <section xml:id="dbdoclet.mrroutingresiliency">
 301       <title><indexterm><primary>MR</primary>
 302         <secondary>mrrouting</secondary>
 303         <tertiary>routingresiliency</tertiary>
 304         </indexterm>Utilizing Router Resiliency</title>
 305       <para>Currently, LNet provides a mechanism to monitor each route entry.
 306       LNet pings each gateway identified in the route entry on regular,
 307       configurable interval to ensure that it is alive. If sending over a
 308       specific route fails or if the router pinger determines that the gateway
 309       is down, then the route is marked as down and is not used. It is
 310       subsequently pinged on regular, configurable intervals to determine when
 311       it becomes alive again.</para>
 312       <para>This mechanism can be combined with the MR feature in Lustre 2.10 to
 313       add this router resiliency feature to the configuration.</para>
 314       <screen>Routers
 315 lnetctl net add --net o2ib0 --if ib0,ib1
 316 lnetctl net add --net o2ib1 --if ib2,ib3
 317 lnetctl peer add --nid &lt;peer1-nidA&gt;@o2ib,&lt;peer1-nidB&gt;@o2ib,...
 318 lnetctl peer add --nid &lt;peer2-nidA&gt;@o2ib1,&lt;peer2-nidB&gt;@o2ib1,...
 319 lnetctl set routing 1
 320
 321 Clients
 322 lnetctl net add --net o2ib0 --if ib0,ib1
 323 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidA&gt;@o2ib
 324 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidB&gt;@o2ib
 325
 326 Servers
 327 lnetctl net add --net o2ib1 --if ib0,ib1
 328 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidA&gt;@o2ib1
 329 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidB&gt;@o2ib1</screen>
 330       <para>There are a few things to note in the above configuration:</para>
 331       <orderedlist>
 332         <listitem>
 333           <para>The clients and the servers are now configured with two
 334           routes, each route's gateway is one of the interfaces of the
 335           route.  The clients and servers will view each interface of the
 336           same router as a separate gateway and will monitor them as
 337           described above.</para>
 338         </listitem>
 339         <listitem>
 340           <para>The clients and the servers are not configured to view the
 341           routers as MR capable. This is important because we want to deal
 342           with each interface as a separate peers and not different
 343           interfaces of the same peer.</para>
 344         </listitem>
 345         <listitem>
 346           <para>The routers are configured to view the peers as MR capable.
 347           This is an oddity in the configuration, but is currently required
 348           in order to allow the routers to load balance the traffic load
 349           across its interfaces evenly.</para>
 350         </listitem>
 351       </orderedlist>
 352     </section>
 353     <section xml:id="dbdoclet.mrroutingmixed">
 354       <title><indexterm><primary>MR</primary>
 355         <secondary>mrrouting</secondary>
 356         <tertiary>routingmixed</tertiary>
 357       </indexterm>Mixed Multi-Rail/Non-Multi-Rail Cluster</title>
 358       <para>The above principles can be applied to mixed MR/Non-MR cluster.
 359       For example, the same configuration shown above can be applied if the
 360       clients and the servers are non-MR while the routers are MR capable.
 361       This appears to be a common cluster upgrade scenario.</para>
 362     </section>
 363   </section>
 364   <section xml:id="mrrouting.health" condition="l2D">
 365     <title><indexterm><primary>MR</primary>
 366       <secondary>mrroutinghealth</secondary>
 367       </indexterm>Multi-Rail Routing with LNet Health</title>
 368     <para>This section details how routing and pertinent module parameters can
 369     be configured beginning with Lustre 2.13.</para>
 370     <para>Multi-Rail with Dynamic Discovery allows LNet to discover and use all
 371     configured interfaces of a node. It references a node via it's primary NID.
 372     Multi-Rail routing carries forward this concept to the routing
 373     infrastructure.  The following changes are brought in with the Lustre 2.13
 374     release:</para>
 375     <orderedlist>
 376       <listitem><para>Configuring a different route per gateway interface is no
 377       longer needed. One route per gateway should be configured. Gateway
 378       interfaces are used according to the Multi-Rail selection criteria.</para>
 379       </listitem>
 380       <listitem><para>Routing now relies on <xref linkend="dbdoclet.mrhealth" />
 381       to keep track of the route aliveness.</para></listitem>
 382       <listitem><para>Router interfaces are monitored via LNet Health.
 383       If an interface fails other interfaces will be used.</para></listitem>
 384       <listitem><para>Routing uses LNet discovery to discover gateways on
 385       regular intervals.</para></listitem>
 386       <listitem><para>A gateway pushes its list of interfaces upon the discovery
 387       of any changes in its interfaces' state.</para></listitem>
 388     </orderedlist>
 389     <section xml:id="mrrouting.health_config">
 390       <title><indexterm><primary>MR</primary>
 391         <secondary>mrrouting</secondary>
 392         <tertiary>routinghealth_config</tertiary>
 393         </indexterm>Configuration</title>
 394       <section xml:id="mrrouting.health_config.routes">
 395       <title>Configuring Routes</title>
 396       <para>A gateway can have multiple interfaces on the same or different
 397       networks. The peers using the gateway can reach it on one or
 398       more of its interfaces. Multi-Rail routing takes care of managing which
 399       interface to use.</para>
 400       <screen>lnetctl route add --net &lt;remote network&gt; --gateway &lt;NID for the gateway&gt;
 401                   --hops &lt;number of hops&gt; --priority &lt;route priority&gt;</screen>
 402       </section>
 403       <section xml:id="mrrouting.health_config.modparams">
 404         <title>Configuring Module Parameters</title>
 405         <table frame="all" xml:id="mrrouting.health_config.tab1">
 406         <title>Configuring Module Parameters</title>
 407         <tgroup cols="2">
 408           <colspec colname="c1" colwidth="1*" />
 409           <colspec colname="c2" colwidth="2*" />
 410           <thead>
 411             <row>
 412               <entry>
 413                 <para>
 414                   <emphasis role="bold">Module Parameter</emphasis>
 415                 </para>
 416               </entry>
 417               <entry>
 418                 <para>
 419                   <emphasis role="bold">Usage</emphasis>
 420                 </para>
 421               </entry>
 422             </row>
 423           </thead>
 424           <tbody>
 425             <row>
 426               <entry>
 427                 <para><literal>check_routers_before_use</literal></para>
 428               </entry>
 429               <entry>
 430                 <para>Defaults to <literal>0</literal>. If set to
 431                 <literal>1</literal> all routers must be up before the system
 432                 can proceed.</para>
 433               </entry>
 434             </row>
 435             <row>
 436               <entry>
 437                 <para><literal>avoid_asym_router_failure</literal></para>
 438               </entry>
 439               <entry>
 440                 <para>Defaults to <literal>1</literal>. If set to
 441                 <literal>1</literal> a route will be considered up if and only
 442                 if there exists at least one healthy interface on the local and
 443                 remote interfaces of the gateway.</para>
 444               </entry>
 445             </row>
 446             <row>
 447               <entry>
 448                 <para><literal>alive_router_check_interval</literal></para>
 449               </entry>
 450               <entry>
 451                 <para>Defaults to <literal>60</literal> seconds. The gateways
 452                 will be discovered ever
 453                 <literal>alive_router_check_interval</literal>. If the gateway
 454                 can be reached on multiple networks, the interval per network is
 455                 <literal>alive_router_check_interval</literal> / number of
 456                 networks.</para>
 457               </entry>
 458             </row>
 459             <row>
 460               <entry>
 461                 <para><literal>router_ping_timeout</literal></para>
 462               </entry>
 463               <entry>
 464                 <para>Defaults to <literal>50</literal> seconds. A gateway sets
 465                 its interface down if it has not received any traffic for
 466                 <literal>router_ping_timeout + alive_router_check_interval
 467                 </literal>
 468                 </para>
 469               </entry>
 470             </row>
 471             <row>
 472               <entry>
 473                 <para><literal>router_sensitivity_percentage</literal></para>
 474               </entry>
 475               <entry>
 476                 <para>Defaults to <literal>100</literal>. This parameter defines
 477                 how sensitive a gateway interface is to failure. If set to 100
 478                 then any gateway interface failure will contribute to all routes
 479                 using it going down. The lower the value the more tolerant to
 480                 failures the system becomes.</para>
 481               </entry>
 482             </row>
 483           </tbody>
 484         </tgroup>
 485         </table>
 486       </section>
 487     </section>
 488     <section xml:id="mrrouting.health_routerhealth">
 489       <title><indexterm><primary>MR</primary>
 490         <secondary>mrrouting</secondary>
 491         <tertiary>routinghealth_routerhealth</tertiary>
 492         </indexterm>Router Health</title>
 493       <para>The routing infrastructure now relies on LNet Health to keep track
 494       of interface health. Each gateway interface has a health value
 495       associated with it. If a send fails to one of these interfaces, then the
 496       interface's health value is decremented and placed on a recovery queue.
 497       The unhealthy interface is then pinged every
 498       <literal>lnet_recovery_interval</literal>. This value defaults to
 499       <literal>1</literal> second.</para>
 500       <para>If the peer receives a message from the gateway, then it immediately
 501       assumes that the gateway's interface is up and resets its health value to
 502       maximum. This is needed to ensure we start using the gateways immediately
 503       instead of holding off until the interface is back to full health.</para>
 504     </section>
 505     <section xml:id="mrrouting.health_discovery">
 506       <title><indexterm><primary>MR</primary>
 507         <secondary>mrrouting</secondary>
 508         <tertiary>routinghealth_discovery</tertiary>
 509         </indexterm>Discovery</title>
 510       <para>LNet Discovery is used in place of pinging the peers. This serves
 511       two purposes:</para>
 512       <orderedlist>
 513         <listitem><para>The discovery communication infrastructure does not need
 514         to be duplicated for the routing feature.</para></listitem>
 515         <listitem><para>It allows propagation of the gateway's interface state
 516         changes to the peers using the gateway.</para></listitem>
 517       </orderedlist>
 518       <para>For (2), if an interface changes state from <literal>UP</literal> to
 519       <literal>DOWN</literal> or vice versa, then a discovery
 520       <literal>PUSH</literal> is sent to all the peers which can be reached.
 521       This allows peers to adapt to changes quicker.</para>
 522       <para>Discovery is designed to be backwards compatible. The discovery
 523       protocol is composed of a <literal>GET</literal> and a
 524       <literal>PUT</literal>. The <literal>GET</literal> requests interface
 525       information from the peer, this is a basic lnet ping. The peer responds
 526       with its interface information and a feature bit. If the peer is
 527       multi-rail capable and discovery is turned on, then the node will
 528       <literal>PUSH</literal> its interface information. As a result both peers
 529       will be aware of each other's interfaces.</para>
 530       <para>This information is then used by the peers to decide, based on the
 531       interface state provided by the gateway, whether the route is alive or
 532       not.</para>
 533     </section>
 534     <section xml:id="mrrouting.health_aliveness">
 535       <title><indexterm><primary>MR</primary>
 536         <secondary>mrrouting</secondary>
 537         <tertiary>routinghealth_aliveness</tertiary>
 538         </indexterm>Route Aliveness Criteria</title>
 539       <para>A route is considered alive if the following conditions hold:</para>
 540       <orderedlist>
 541         <listitem><para>The gateway can be reached on the local net via at least
 542         one path.</para></listitem>
 543         <listitem><para>If <literal>avoid_asym_router_failure</literal> is
 544         enabled then the remote network defined in the route must have at least
 545         one healthy interface on the gateway.</para></listitem>
 546       </orderedlist>
 547     </section>
 548   </section>
 549   <section xml:id="dbdoclet.mrhealth" condition="l2C">
 550     <title><indexterm><primary>MR</primary><secondary>health</secondary>
 551     </indexterm>LNet Health</title>
 552     <para>LNet Multi-Rail has implemented the ability for multiple interfaces
 553     to be used on the same LNet network or across multiple LNet networks.  The
 554     LNet Health feature adds the ability to maintain a health value for each
 555     local and remote interface. This allows the Multi-Rail algorithm to
 556     consider the health of the interface before selecting it for sending.
 557     The feature also adds the ability to resend messages across different
 558     interfaces when interface or network failures are detected. This allows
 559     LNet to mitigate communication failures before passing the failures to
 560     upper layers for further error handling. To accomplish this, LNet Health
 561     monitors the status of the send and receive operations and uses this
 562     status to increment the interface's health value in case of success and
 563     decrement it in case of failure.</para>
 564     <section xml:id="dbdoclet.mrhealthvalue">
 565       <title><indexterm><primary>MR</primary>
 566         <secondary>mrhealth</secondary>
 567         <tertiary>value</tertiary>
 568       </indexterm>Health Value</title>
 569       <para>The initial health value of a local or remote interface is set to
 570       <literal>LNET_MAX_HEALTH_VALUE</literal>, currently set to be
 571       <literal>1000</literal>.  The value itself is arbitrary and is meant to
 572       allow for health granularity, as opposed to having a simple boolean state.
 573       The granularity allows the Multi-Rail algorithm to select the interface
 574       that has the highest likelihood of sending or receiving a message.</para>
 575     </section>
 576     <section xml:id="dbdoclet.mrhealthfailuretypes">
 577       <title><indexterm><primary>MR</primary>
 578         <secondary>mrhealth</secondary>
 579         <tertiary>failuretypes</tertiary>
 580       </indexterm>Failure Types and Behavior</title>
 581       <para>LNet health behavior depends on the type of failure detected:</para>
 582       <informaltable frame="all">
 583         <tgroup cols="2">
 584         <colspec colname="c1" colwidth="50*"/>
 585         <colspec colname="c2" colwidth="50*"/>
 586         <thead>
 587           <row>
 588             <entry>
 589               <para><emphasis role="bold">Failure Type</emphasis></para>
 590             </entry>
 591             <entry>
 592               <para><emphasis role="bold">Behavior</emphasis></para>
 593             </entry>
 594           </row>
 595         </thead>
 596         <tbody>
 597           <row>
 598             <entry>
 599               <para><literal>localresend</literal></para>
 600             </entry>
 601             <entry>
 602               <para>A local failure has occurred, such as no route found or an
 603               address resolution error. These failures could be temporary,
 604               therefore LNet will attempt to resend the message. LNet will
 605               decrement the health value of the local interface and will
 606               select it less often if there are multiple available interfaces.
 607               </para>
 608             </entry>
 609           </row>
 610           <row>
 611             <entry>
 612               <para><literal>localno-resend</literal></para>
 613             </entry>
 614             <entry>
 615               <para>A local non-recoverable error occurred in the system, such
 616               as out of memory error. In these cases LNet will not attempt to
 617               resend the message. LNet will decrement the health value of the
 618               local interface and will select it less often if there are
 619               multiple available interfaces.
 620               </para>
 621             </entry>
 622           </row>
 623           <row>
 624             <entry>
 625               <para><literal>remoteno-resend</literal></para>
 626             </entry>
 627             <entry>
 628               <para>If LNet successfully sends a message, but the message does
 629               not complete or an expected reply is not received, then it is
 630               classified as a remote error. LNet will not attempt to resend the
 631               message to avoid duplicate messages on the remote end. LNet will
 632               decrement the health value of the remote interface and will
 633               select it less often if there are multiple available interfaces.
 634               </para>
 635             </entry>
 636           </row>
 637           <row>
 638             <entry>
 639               <para><literal>remoteresend</literal></para>
 640             </entry>
 641             <entry>
 642               <para>There are a set of failures where we can be reasonably sure
 643               that the message was dropped before getting to the remote end. In
 644               this case, LNet will attempt to resend the message. LNet will
 645               decrement the health value of the remote interface and will
 646               select it less often if there are multiple available interfaces.
 647               </para>
 648             </entry>
 649           </row>
 650         </tbody></tgroup>
 651       </informaltable>
 652     </section>
 653     <section xml:id="dbdoclet.mrhealthinterface">
 654       <title><indexterm><primary>MR</primary>
 655         <secondary>mrhealth</secondary>
 656         <tertiary>interface</tertiary>
 657       </indexterm>User Interface</title>
 658       <para>LNet Health is turned off by default. There are multiple module
 659       parameters available to control the LNet Health feature.</para>
 660       <para>All the module parameters are implemented in sysfs and are located
 661       in /sys/module/lnet/parameters/. They can be set directly by echoing a
 662       value into them as well as from lnetctl.</para>
 663       <informaltable frame="all">
 664         <tgroup cols="2">
 665         <colspec colname="c1" colwidth="50*"/>
 666         <colspec colname="c2" colwidth="50*"/>
 667         <thead>
 668           <row>
 669             <entry>
 670               <para><emphasis role="bold">Parameter</emphasis></para>
 671             </entry>
 672             <entry>
 673               <para><emphasis role="bold">Description</emphasis></para>
 674             </entry>
 675           </row>
 676         </thead>
 677         <tbody>
 678           <row>
 679             <entry>
 680               <para><literal>lnet_health_sensitivity</literal></para>
 681             </entry>
 682             <entry>
 683               <para>When LNet detects a failure on a particular interface it
 684               will decrement its Health Value by
 685               <literal>lnet_health_sensitivity</literal>. The greater the value,
 686               the longer it takes for that interface to become healthy again.
 687               The default value of <literal>lnet_health_sensitivity</literal>
 688               is set to 0, which means the health value will not be decremented.
 689               In essense, the health feature is turned off.</para>
 690               <para>The sensitivity value can be set greater than 0.  A
 691               <literal>lnet_health_sensitivity</literal> of 100 would mean that
 692               10 consecutive message failures or a steady-state failure rate
 693               over 1% would degrade the interface Health Value until it is
 694               disabled, while a lower failure rate would steer traffic away from
 695               the interface but it would continue to be available.  When a
 696               failure occurs on an interface then its Health Value is
 697               decremented and the interface is flagged for recovery.</para>
 698               <screen>lnetctl set health_sensitivity: sensitivity to failure
 699       0 - turn off health evaluation
 700       &gt;0 - sensitivity value not more than 1000</screen>
 701             </entry>
 702           </row>
 703           <row>
 704             <entry>
 705               <para><literal>lnet_recovery_interval</literal></para>
 706             </entry>
 707             <entry>
 708               <para>When LNet detects a failure on a local or remote interface
 709               it will place that interface on a recovery queue. There is a
 710               recovery queue for local interfaces and another for remote
 711               interfaces. The interfaces on the recovery queues will be LNet
 712               PINGed every <literal>lnet_recovery_interval</literal>. This value
 713               defaults to <literal>1</literal> second. On every successful PING
 714               the health value of the interface pinged will be incremented by
 715               <literal>1</literal>.</para>
 716               <para>Having this value configurable allows system administrators
 717               to control the amount of control traffic on the network.</para>
 718               <screen>lnetctl set recovery_interval: interval to ping unhealthy interfaces
 719       &gt;0 - timeout in seconds</screen>
 720             </entry>
 721           </row>
 722           <row>
 723             <entry>
 724               <para><literal>lnet_transaction_timeout</literal></para>
 725             </entry>
 726             <entry>
 727               <para>This timeout is somewhat of an overloaded value. It carries
 728               the following functionality:</para>
 729               <itemizedlist>
 730                 <listitem>
 731                   <para>A message is abandoned if it is not sent successfully
 732                   when the lnet_transaction_timeout expires and the retry_count
 733                   is not reached.</para>
 734                 </listitem>
 735                 <listitem>
 736                   <para>A GET or a PUT which expects an ACK expires if a REPLY
 737                   or an ACK respectively, is not received within the
 738                   <literal>lnet_transaction_timeout</literal>.</para>
 739                 </listitem>
 740               </itemizedlist>
 741               <para>This value defaults to 30 seconds.</para>
 742               <screen>lnetctl set transaction_timeout: Message/Response timeout
 743       &gt;0 - timeout in seconds</screen>
 744               <note><para>The LND timeout will now be a fraction of the
 745               <literal>lnet_transaction_timeout</literal> as described in the
 746               next section.</para>
 747               <para>This means that in networks where very large delays are
 748               expected then it will be necessary to increase this value
 749               accordingly.</para></note>
 750             </entry>
 751           </row>
 752           <row>
 753             <entry>
 754               <para><literal>lnet_retry_count</literal></para>
 755             </entry>
 756             <entry>
 757               <para>When LNet detects a failure which it deems appropriate for
 758               re-sending a message it will check if a message has passed the
 759               maximum retry_count specified. After which if a message wasn't
 760               sent successfully a failure event will be passed up to the layer
 761               which initiated message sending.</para>
 762               <para>Since the message retry interval
 763               (<literal>lnet_lnd_timeout</literal>) is computed from
 764               <literal>lnet_transaction_timeout / lnet_retry_count</literal>,
 765               the <literal>lnet_retry_count</literal> should be kept low enough
 766               that the retry interval is not shorter than the round-trip message
 767               delay in the network.  A <literal>lnet_retry_count</literal> of 5
 768               is reasonable for the default
 769               <literal>lnet_transaction_timeout</literal> of 50 seconds.</para>
 770               <screen>lnetctl set retry_count: number of retries
 771       0 - turn off retries
 772       &gt;0 - number of retries, cannot be more than <literal>lnet_transaction_timeout</literal></screen>
 773             </entry>
 774           </row>
 775           <row>
 776             <entry>
 777               <para><literal>lnet_lnd_timeout</literal></para>
 778             </entry>
 779             <entry>
 780               <para>This is not a configurable parameter. But it is derived from
 781               two configurable parameters:
 782               <literal>lnet_transaction_timeout</literal> and
 783               <literal>retry_count</literal>.</para>
 784               <screen>lnet_lnd_timeout = lnet_transaction_timeout / retry_count
 785               </screen>
 786               <para>As such there is a restriction that
 787               <literal>lnet_transaction_timeout &gt;= retry_count</literal>
 788               </para>
 789               <para>The core assumption here is that in a healthy network,
 790               sending and receiving LNet messages should not have large delays.
 791               There could be large delays with RPC messages and their responses,
 792               but that's handled at the PtlRPC layer.</para>
 793             </entry>
 794           </row>
 795         </tbody>
 796         </tgroup>
 797       </informaltable>
 798     </section>
 799     <section xml:id="dbdoclet.mrhealthdisplay">
 800       <title><indexterm><primary>MR</primary>
 801         <secondary>mrhealth</secondary>
 802         <tertiary>display</tertiary>
 803       </indexterm>Displaying Information</title>
 804       <section xml:id="dbdoclet.mrhealthdisplayhealth">
 805         <title>Showing LNet Health Configuration Settings</title>
 806         <para><literal>lnetctl</literal> can be used to show all the LNet health
 807         configuration settings using the <literal>lnetctl global show</literal>
 808         command.</para>
 809         <screen>#&gt; lnetctl global show
 810       global:
 811       numa_range: 0
 812       max_intf: 200
 813       discovery: 1
 814       retry_count: 3
 815       transaction_timeout: 10
 816       health_sensitivity: 100
 817       recovery_interval: 1</screen>
 818       </section>
 819       <section xml:id="dbdoclet.mrhealthdisplaystats">
 820         <title>Showing LNet Health Statistics</title>
 821         <para>LNet Health statistics are shown under a higher verbosity
 822         settings.  To show the local interface health statistics:</para>
 823         <screen>lnetctl net show -v 3</screen>
 824         <para>To show the remote interface health statistics:</para>
 825         <screen>lnetctl peer show -v 3</screen>
 826         <para>Sample output:</para>
 827         <screen>#&gt; lnetctl net show -v 3
 828       net:
 829       - net type: tcp
 830         local NI(s):
 831            - nid: 192.168.122.108@tcp
 832              status: up
 833              interfaces:
 834                  0: eth2
 835              statistics:
 836                  send_count: 304
 837                  recv_count: 284
 838                  drop_count: 0
 839              sent_stats:
 840                  put: 176
 841                  get: 138
 842                  reply: 0
 843                  ack: 0
 844                  hello: 0
 845              received_stats:
 846                  put: 145
 847                  get: 137
 848                  reply: 0
 849                  ack: 2
 850                  hello: 0
 851              dropped_stats:
 852                  put: 10
 853                  get: 0
 854                  reply: 0
 855                  ack: 0
 856                  hello: 0
 857              health stats:
 858                  health value: 1000
 859                  interrupts: 0
 860                  dropped: 10
 861                  aborted: 0
 862                  no route: 0
 863                  timeouts: 0
 864                  error: 0
 865              tunables:
 866                  peer_timeout: 180
 867                  peer_credits: 8
 868                  peer_buffer_credits: 0
 869                  credits: 256
 870              dev cpt: -1
 871              tcp bonding: 0
 872              CPT: "[0]"
 873       CPT: &quot;[0]&quot;</screen>
 874         <para>There is a new YAML block, <literal>health stats</literal>, which
 875         displays the health statistics for each local or remote network
 876         interface.</para>
 877         <para>Global statistics also dump the global health statistics as shown
 878         below:</para>
 879         <screen>#&gt; lnetctl stats show
 880         statistics:
 881             msgs_alloc: 0
 882             msgs_max: 33
 883             rst_alloc: 0
 884             errors: 0
 885             send_count: 901
 886             resend_count: 4
 887             response_timeout_count: 0
 888             local_interrupt_count: 0
 889             local_dropped_count: 10
 890             local_aborted_count: 0
 891             local_no_route_count: 0
 892             local_timeout_count: 0
 893             local_error_count: 0
 894             remote_dropped_count: 0
 895             remote_error_count: 0
 896             remote_timeout_count: 0
 897             network_timeout_count: 0
 898             recv_count: 851
 899             route_count: 0
 900             drop_count: 10
 901             send_length: 425791628
 902             recv_length: 69852
 903             route_length: 0
 904             drop_length: 0</screen>
 905       </section>
 906     </section>
 907     <section xml:id="dbdoclet.mrhealthinitialsetup">
 908       <title><indexterm><primary>MR</primary>
 909         <secondary>mrhealth</secondary>
 910         <tertiary>initialsetup</tertiary>
 911       </indexterm>Initial Settings Recommendations</title>
 912       <para>LNet Health is off by default. This means that
 913       <literal>lnet_health_sensitivity</literal> and
 914       <literal>lnet_retry_count</literal> are set to <literal>0</literal>.
 915       </para>
 916       <para>Setting <literal>lnet_health_sensitivity</literal> to
 917       <literal>0</literal> will not decrement the health of the interface on
 918       failure and will not change the interface selection behavior. Furthermore,
 919       the failed interfaces will not be placed on the recovery queues. In
 920       essence, turning off the LNet Health feature.</para>
 921       <para>The LNet Health settings will need to be tuned for each cluster.
 922       However, the base configuration would be as follows:</para>
 923       <screen>#&gt; lnetctl global show
 924     global:
 925         numa_range: 0
 926         max_intf: 200
 927         discovery: 1
 928         retry_count: 3
 929         transaction_timeout: 10
 930         health_sensitivity: 100
 931         recovery_interval: 1</screen>
 932       <para>This setting will allow a maximum of two retries for failed messages
 933       within the 5 second transaction timeout.</para>
 934       <para>If there is a failure on the interface the health value will be
 935       decremented by 1 and the interface will be LNet PINGed every 1 second.
 936       </para>
 937     </section>
 938   </section>
 939 </chapter>