LNetMultiRail.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="lnetmr" condition='l2A'>
   5   <title xml:id="lnetmr.title">LNet Software Multi-Rail</title>
   6   <para>This chapter describes LNet Software Multi-Rail configuration and
   7   administration.</para>
   8   <itemizedlist>
   9     <listitem>
  10       <para><xref linkend="mroverview"/></para>
  11       <para><xref linkend="mrconfiguring"/></para>
  12       <para><xref linkend="mrrouting"/></para>
  13       <para><xref linkend="mrrouting.health"/></para>
  14       <para><xref linkend="mrhealth"/></para>
  15     </listitem>
  16   </itemizedlist>
  17   <section xml:id="mroverview">
  18     <title><indexterm><primary>MR</primary><secondary>overview</secondary>
  19     </indexterm>Multi-Rail Overview</title>
  20     <para>In computer networking, multi-rail is an arrangement in which two or
  21     more network interfaces to a single network on a computer node are employed,
  22     to achieve increased throughput.  Multi-rail can also be where a node has
  23     one or more interfaces to multiple, even different kinds of networks, such
  24     as Ethernet, Infiniband, and Intel® Omni-Path. For Lustre clients,
  25     multi-rail generally presents the combined network capabilities as a single
  26     LNet network.  Peer nodes that are multi-rail capable are established during
  27     configuration, as are user-defined interface-section policies.</para>
  28     <para>The following link contains a detailed high-level design for the
  29     feature:
  30     <link xl:href="https://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf">
  31     Multi-Rail High-Level Design</link></para>
  32   </section>
  33   <section xml:id="mrconfiguring">
  34     <title><indexterm><primary>MR</primary><secondary>configuring</secondary>
  35     </indexterm>Configuring Multi-Rail</title>
  36     <para>Every node using multi-rail networking needs to be properly
  37     configured.  Multi-rail uses <literal>lnetctl</literal> and the LNet
  38     Configuration Library for configuration.  Configuring multi-rail for a
  39     given node involves two tasks:</para>
  40     <orderedlist>
  41       <listitem><para>Configuring multiple network interfaces present on the
  42       local node.</para></listitem>
  43       <listitem><para>Adding remote peers that are multi-rail capable (are
  44       connected to one or more common networks with at least two interfaces).
  45       </para></listitem>
  46     </orderedlist>
  47     <para>This section is a supplement to
  48       <xref linkend="lnet_config.lnetaddshowdelete" /> and contains further
  49       examples for Multi-Rail configurations.</para>
  50     <para>For information on the dynamic peer discovery feature added in
  51       Lustre Release 2.11.0, see
  52       <xref linkend="lnet_config.dynamic_discovery" />.</para>
  53     <section xml:id="addinterfaces">
  54       <title><indexterm><primary>MR</primary>
  55       <secondary>multipleinterfaces</secondary>
  56       </indexterm>Configure Multiple Interfaces on the Local Node</title>
  57       <para>Example <literal>lnetctl add</literal> command with multiple
  58       interfaces in a Multi-Rail configuration:</para>
  59       <screen>lnetctl net add --net tcp --if eth0,eth1</screen>
  60       <para>Example of YAML net show:</para>
  61       <screen>lnetctl net show -v
  62 net:
  63     - net type: lo
  64       local NI(s):
  65         - nid: 0@lo
  66           status: up
  67           statistics:
  68               send_count: 0
  69               recv_count: 0
  70               drop_count: 0
  71           tunables:
  72               peer_timeout: 0
  73               peer_credits: 0
  74               peer_buffer_credits: 0
  75               credits: 0
  76           lnd tunables:
  77           tcp bonding: 0
  78           dev cpt: 0
  79           CPT: "[0]"
  80     - net type: tcp
  81       local NI(s):
  82         - nid: 192.168.122.10@tcp
  83           status: up
  84           interfaces:
  85               0: eth0
  86           statistics:
  87               send_count: 0
  88               recv_count: 0
  89               drop_count: 0
  90           tunables:
  91               peer_timeout: 180
  92               peer_credits: 8
  93               peer_buffer_credits: 0
  94               credits: 256
  95           lnd tunables:
  96           tcp bonding: 0
  97           dev cpt: -1
  98           CPT: "[0]"
  99         - nid: 192.168.122.11@tcp
 100           status: up
 101           interfaces:
 102               0: eth1
 103           statistics:
 104               send_count: 0
 105               recv_count: 0
 106               drop_count: 0
 107           tunables:
 108               peer_timeout: 180
 109               peer_credits: 8
 110               peer_buffer_credits: 0
 111               credits: 256
 112           lnd tunables:
 113           tcp bonding: 0
 114           dev cpt: -1
 115           CPT: "[0]"</screen>
 116     </section>
 117     <section xml:id="deleteinterfaces">
 118       <title><indexterm><primary>MR</primary>
 119         <secondary>deleteinterfaces</secondary>
 120         </indexterm>Deleting Network Interfaces</title>
 121       <para>Example delete with <literal>lnetctl net del</literal>:</para>
 122       <para>Assuming the network configuration is as shown above with the
 123       <literal>lnetctl net show -v</literal> in the previous section, we can
 124       delete a net with following command:</para>
 125       <screen>lnetctl net del --net tcp --if eth0</screen>
 126       <para>The resultant net information would look like:</para>
 127       <screen>lnetctl net show -v
 128 net:
 129     - net type: lo
 130       local NI(s):
 131         - nid: 0@lo
 132           status: up
 133           statistics:
 134               send_count: 0
 135               recv_count: 0
 136               drop_count: 0
 137           tunables:
 138               peer_timeout: 0
 139               peer_credits: 0
 140               peer_buffer_credits: 0
 141               credits: 0
 142           lnd tunables:
 143           tcp bonding: 0
 144           dev cpt: 0
 145           CPT: "[0,1,2,3]"</screen>
 146       <para>The syntax of a YAML file to perform a delete would be:</para>
 147       <screen>- net type: tcp
 148    local NI(s):
 149      - nid: 192.168.122.10@tcp
 150        interfaces:
 151            0: eth0</screen>
 152     </section>
 153     <section xml:id="addremotepeers">
 154       <title><indexterm><primary>MR</primary>
 155         <secondary>addremotepeers</secondary>
 156         </indexterm>Adding Remote Peers that are Multi-Rail Capable</title>
 157       <para>The following example <literal>lnetctl peer add</literal>
 158       command adds a peer with 2 nids, with
 159         <literal>192.168.122.30@tcp</literal> being the primary nid:</para>
 160       <screen>lnetctl peer add --prim_nid 192.168.122.30@tcp --nid 192.168.122.30@tcp,192.168.122.31@tcp
 161       </screen>
 162       <para>The resulting <literal>lnetctl peer show</literal> would be:
 163         <screen>lnetctl peer show -v
 164 peer:
 165     - primary nid: 192.168.122.30@tcp
 166       Multi-Rail: True
 167       peer ni:
 168         - nid: 192.168.122.30@tcp
 169           state: NA
 170           max_ni_tx_credits: 8
 171           available_tx_credits: 8
 172           min_tx_credits: 7
 173           tx_q_num_of_buf: 0
 174           available_rtr_credits: 8
 175           min_rtr_credits: 8
 176           refcount: 1
 177           statistics:
 178               send_count: 2
 179               recv_count: 2
 180               drop_count: 0
 181         - nid: 192.168.122.31@tcp
 182           state: NA
 183           max_ni_tx_credits: 8
 184           available_tx_credits: 8
 185           min_tx_credits: 7
 186           tx_q_num_of_buf: 0
 187           available_rtr_credits: 8
 188           min_rtr_credits: 8
 189           refcount: 1
 190           statistics:
 191               send_count: 1
 192               recv_count: 1
 193               drop_count: 0</screen>
 194       </para>
 195       <para>The following is an example YAML file for adding a peer:</para>
 196       <screen>addPeer.yaml
 197 peer:
 198     - primary nid: 192.168.122.30@tcp
 199       Multi-Rail: True
 200       peer ni:
 201         - nid: 192.168.122.31@tcp</screen>
 202     </section>
 203     <section xml:id="deleteremotepeers">
 204       <title><indexterm><primary>MR</primary>
 205         <secondary>deleteremotepeers</secondary>
 206         </indexterm>Deleting Remote Peers</title>
 207       <para>Example of deleting a single nid of a peer (192.168.122.31@tcp):
 208       </para>
 209       <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp --nid 192.168.122.31@tcp</screen>
 210       <para>Example of deleting the entire peer:</para>
 211       <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp</screen>
 212       <para>Example of deleting a peer via YAML:</para>
 213       <screen>Assuming the following peer configuration:
 214 peer:
 215     - primary nid: 192.168.122.30@tcp
 216       Multi-Rail: True
 217       peer ni:
 218         - nid: 192.168.122.30@tcp
 219           state: NA
 220         - nid: 192.168.122.31@tcp
 221           state: NA
 222         - nid: 192.168.122.32@tcp
 223           state: NA
 224
 225 You can delete 192.168.122.32@tcp as follows:
 226
 227 delPeer.yaml
 228 peer:
 229     - primary nid: 192.168.122.30@tcp
 230       Multi-Rail: True
 231       peer ni:
 232         - nid: 192.168.122.32@tcp
 233
 234 % lnetctl import --del &lt; delPeer.yaml</screen>
 235     </section>
 236   </section>
 237   <section xml:id="mrrouting">
 238     <title><indexterm><primary>MR</primary>
 239       <secondary>mrrouting</secondary>
 240       </indexterm>Notes on routing with Multi-Rail</title>
 241     <para>This section details how to configure Multi-Rail with the routing
 242     feature before the <xref linkend="mrrouting.health" /> feature landed in
 243     Lustre 2.13. Routing code has always monitored the state of the route, in
 244     order to avoid using unavailable ones.</para>
 245     <para>This section describes how you can configure multiple interfaces on
 246     the same gateway node but as different routes. This uses the existing route
 247     monitoring algorithm to guard against interfaces going down.  With the
 248     <xref linkend="mrrouting.health" /> feature introduced in Lustre 2.13, the
 249     new algorithm uses the <xref linkend="mrhealth" /> feature to
 250     monitor the different interfaces of the gateway and always ensures that the
 251     healthiest interface is used. Therefore, the configuration described in this
 252     section applies to releases prior to Lustre 2.13.  It will still work in
 253     2.13 as well, however it is not required due to the reason mentioned above.
 254     </para>
 255     <section xml:id="mrroutingex">
 256       <title><indexterm><primary>MR</primary>
 257         <secondary>mrrouting</secondary>
 258         <tertiary>routingex</tertiary>
 259         </indexterm>Multi-Rail Cluster Example</title>
 260       <para>The below example outlines a simple system where all the Lustre
 261       nodes are MR capable.  Each node in the cluster has two interfaces.</para>
 262       <figure xml:id="lnetmultirail.fig.routingdiagram">
 263         <title>Routing Configuration with Multi-Rail</title>
 264         <mediaobject>
 265           <imageobject>
 266             <imagedata scalefit="1" width="100%"
 267             fileref="./figures/MR_RoutingConfig.png" />
 268           </imageobject>
 269           <textobject>
 270             <phrase>Routing Configuration with Multi-Rail</phrase>
 271           </textobject>
 272         </mediaobject>
 273       </figure>
 274       <para>The routers can aggregate the interfaces on each side of the network
 275       by configuring them on the appropriate network.</para>
 276       <para>An example configuration:</para>
 277       <screen>Routers
 278 lnetctl net add --net o2ib0 --if ib0,ib1
 279 lnetctl net add --net o2ib1 --if ib2,ib3
 280 lnetctl peer add --nid &lt;peer1-nidA&gt;@o2ib,&lt;peer1-nidB&gt;@o2ib,...
 281 lnetctl peer add --nid &lt;peer2-nidA&gt;@o2ib1,&lt;peer2-nidB>&gt;@o2ib1,...
 282 lnetctl set routing 1
 283
 284 Clients
 285 lnetctl net add --net o2ib0 --if ib0,ib1
 286 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidA&gt;@o2ib
 287 lnetctl peer add --nid &lt;rtrX-nidA&gt;@o2ib,&lt;rtrX-nidB&gt;@o2ib
 288
 289 Servers
 290 lnetctl net add --net o2ib1 --if ib0,ib1
 291 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidA&gt;@o2ib1
 292 lnetctl peer add --nid &lt;rtrX-nidA&gt;@o2ib1,&lt;rtrX-nidB&gt;@o2ib1</screen>
 293       <para>In the above configuration the clients and the servers are
 294       configured with only one route entry per router. This works because the
 295       routers are MR capable. By adding the routers as peers with multiple
 296       interfaces to the clients and the servers, when sending to the router the
 297       MR algorithm will ensure that bot interfaces of the routers are used.
 298       </para>
 299       <para>However, as of the Lustre 2.10 release LNet Resiliency is still
 300       under development and single interface failure will still cause the entire
 301       router to go down.</para>
 302     </section>
 303     <section xml:id="mrroutingresiliency">
 304       <title><indexterm><primary>MR</primary>
 305         <secondary>mrrouting</secondary>
 306         <tertiary>routingresiliency</tertiary>
 307         </indexterm>Utilizing Router Resiliency</title>
 308       <para>Currently, LNet provides a mechanism to monitor each route entry.
 309       LNet pings each gateway identified in the route entry on regular,
 310       configurable interval to ensure that it is alive. If sending over a
 311       specific route fails or if the router pinger determines that the gateway
 312       is down, then the route is marked as down and is not used. It is
 313       subsequently pinged on regular, configurable intervals to determine when
 314       it becomes alive again.</para>
 315       <para>This mechanism can be combined with the MR feature in Lustre 2.10 to
 316       add this router resiliency feature to the configuration.</para>
 317       <screen>Routers
 318 lnetctl net add --net o2ib0 --if ib0,ib1
 319 lnetctl net add --net o2ib1 --if ib2,ib3
 320 lnetctl peer add --nid &lt;peer1-nidA&gt;@o2ib,&lt;peer1-nidB&gt;@o2ib,...
 321 lnetctl peer add --nid &lt;peer2-nidA&gt;@o2ib1,&lt;peer2-nidB&gt;@o2ib1,...
 322 lnetctl set routing 1
 323
 324 Clients
 325 lnetctl net add --net o2ib0 --if ib0,ib1
 326 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidA&gt;@o2ib
 327 lnetctl route add --net o2ib1 --gateway &lt;rtrX-nidB&gt;@o2ib
 328
 329 Servers
 330 lnetctl net add --net o2ib1 --if ib0,ib1
 331 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidA&gt;@o2ib1
 332 lnetctl route add --net o2ib0 --gateway &lt;rtrX-nidB&gt;@o2ib1</screen>
 333       <para>There are a few things to note in the above configuration:</para>
 334       <orderedlist>
 335         <listitem>
 336           <para>The clients and the servers are now configured with two
 337           routes, each route's gateway is one of the interfaces of the
 338           route.  The clients and servers will view each interface of the
 339           same router as a separate gateway and will monitor them as
 340           described above.</para>
 341         </listitem>
 342         <listitem>
 343           <para>The clients and the servers are not configured to view the
 344           routers as MR capable. This is important because we want to deal
 345           with each interface as a separate peers and not different
 346           interfaces of the same peer.</para>
 347         </listitem>
 348         <listitem>
 349           <para>The routers are configured to view the peers as MR capable.
 350           This is an oddity in the configuration, but is currently required
 351           in order to allow the routers to load balance the traffic load
 352           across its interfaces evenly.</para>
 353         </listitem>
 354       </orderedlist>
 355     </section>
 356     <section xml:id="mrroutingmixed">
 357       <title><indexterm><primary>MR</primary>
 358         <secondary>mrrouting</secondary>
 359         <tertiary>routingmixed</tertiary>
 360       </indexterm>Mixed Multi-Rail/Non-Multi-Rail Cluster</title>
 361       <para>The above principles can be applied to mixed MR/Non-MR cluster.
 362       For example, the same configuration shown above can be applied if the
 363       clients and the servers are non-MR while the routers are MR capable.
 364       This appears to be a common cluster upgrade scenario.</para>
 365     </section>
 366   </section>
 367   <section xml:id="mrrouting.health" condition="l2D">
 368     <title><indexterm><primary>MR</primary>
 369       <secondary>mrroutinghealth</secondary>
 370       </indexterm>Multi-Rail Routing with LNet Health</title>
 371     <para>This section details how routing and pertinent module parameters can
 372     be configured beginning with Lustre 2.13.</para>
 373     <para>Multi-Rail with Dynamic Discovery allows LNet to discover and use all
 374     configured interfaces of a node. It references a node via it's primary NID.
 375     Multi-Rail routing carries forward this concept to the routing
 376     infrastructure.  The following changes are brought in with the Lustre 2.13
 377     release:</para>
 378     <orderedlist>
 379       <listitem><para>Configuring a different route per gateway interface is no
 380       longer needed. One route per gateway should be configured. Gateway
 381       interfaces are used according to the Multi-Rail selection criteria.</para>
 382       </listitem>
 383       <listitem><para>Routing now relies on <xref linkend="mrhealth" />
 384       to keep track of the route aliveness.</para></listitem>
 385       <listitem><para>Router interfaces are monitored via LNet Health.
 386       If an interface fails other interfaces will be used.</para></listitem>
 387       <listitem><para>Routing uses LNet discovery to discover gateways on
 388       regular intervals.</para></listitem>
 389       <listitem><para>A gateway pushes its list of interfaces upon the discovery
 390       of any changes in its interfaces' state.</para></listitem>
 391     </orderedlist>
 392     <section xml:id="mrrouting.health_config">
 393       <title><indexterm><primary>MR</primary>
 394         <secondary>mrrouting</secondary>
 395         <tertiary>routinghealth_config</tertiary>
 396         </indexterm>Configuration</title>
 397       <section xml:id="mrrouting.health_config.routes">
 398       <title>Configuring Routes</title>
 399       <para>A gateway can have multiple interfaces on the same or different
 400       networks. The peers using the gateway can reach it on one or
 401       more of its interfaces. Multi-Rail routing takes care of managing which
 402       interface to use.</para>
 403       <screen>lnetctl route add --net &lt;remote network&gt;
 404       --gateway &lt;NID for the gateway&gt;
 405       --hop &lt;number of hops&gt; --priority &lt;route priority&gt;
 406       </screen>
 407       </section>
 408       <section xml:id="mrrouting.health_config.modparams">
 409         <title>Configuring Module Parameters</title>
 410         <table frame="all" xml:id="mrrouting.health_config.tab1">
 411         <title>Configuring Module Parameters</title>
 412         <tgroup cols="2">
 413           <colspec colname="c1" colwidth="1*" />
 414           <colspec colname="c2" colwidth="2*" />
 415           <thead>
 416             <row>
 417               <entry>
 418                 <para>
 419                   <emphasis role="bold">Module Parameter</emphasis>
 420                 </para>
 421               </entry>
 422               <entry>
 423                 <para>
 424                   <emphasis role="bold">Usage</emphasis>
 425                 </para>
 426               </entry>
 427             </row>
 428           </thead>
 429           <tbody>
 430             <row>
 431               <entry>
 432                 <para><literal>check_routers_before_use</literal></para>
 433               </entry>
 434               <entry>
 435                 <para>Defaults to <literal>0</literal>. If set to
 436                 <literal>1</literal> all routers must be up before the system
 437                 can proceed.</para>
 438               </entry>
 439             </row>
 440             <row>
 441               <entry>
 442                 <para><literal>avoid_asym_router_failure</literal></para>
 443               </entry>
 444               <entry>
 445                 <para>Defaults to <literal>1</literal>. If set to
 446                 <literal>1</literal> single-hop routes have an additional
 447                 requirement to be considered up. The requirement is that the
 448                 gateway of the route must have at least one healthy network
 449                 interface connected directly to the remote net of the route. In
 450                 this context single-hop routes are routes that are given
 451                 <literal>hop=1</literal> explicitly when created, or routes for
 452                 which lnet can infer that they have only one hop.
 453                 Otherwise the route is not single-hop and this parameter has no
 454                 effect.</para>
 455               </entry>
 456             </row>
 457             <row>
 458               <entry>
 459                 <para><literal>alive_router_check_interval</literal></para>
 460               </entry>
 461               <entry>
 462                 <para>Defaults to <literal>60</literal> seconds. The gateways
 463                 will be discovered ever
 464                 <literal>alive_router_check_interval</literal>. If the gateway
 465                 can be reached on multiple networks, the interval per network is
 466                 <literal>alive_router_check_interval</literal> / number of
 467                 networks.</para>
 468               </entry>
 469             </row>
 470             <row>
 471               <entry>
 472                 <para><literal>router_ping_timeout</literal></para>
 473               </entry>
 474               <entry>
 475                 <para>Defaults to <literal>50</literal> seconds. A gateway sets
 476                 its interface down if it has not received any traffic for
 477                 <literal>router_ping_timeout + alive_router_check_interval
 478                 </literal>
 479                 </para>
 480               </entry>
 481             </row>
 482             <row>
 483               <entry>
 484                 <para><literal>router_sensitivity_percentage</literal></para>
 485               </entry>
 486               <entry>
 487                 <para>Defaults to <literal>100</literal>. This parameter defines
 488                 how sensitive a gateway interface is to failure. If set to 100
 489                 then any gateway interface failure will contribute to all routes
 490                 using it going down. The lower the value the more tolerant to
 491                 failures the system becomes.</para>
 492               </entry>
 493             </row>
 494           </tbody>
 495         </tgroup>
 496         </table>
 497       </section>
 498     </section>
 499     <section xml:id="mrrouting.health_routerhealth">
 500       <title><indexterm><primary>MR</primary>
 501         <secondary>mrrouting</secondary>
 502         <tertiary>routinghealth_routerhealth</tertiary>
 503         </indexterm>Router Health</title>
 504       <para>The routing infrastructure now relies on LNet Health to keep track
 505       of interface health. Each gateway interface has a health value
 506       associated with it. If a send fails to one of these interfaces, then the
 507       interface's health value is decremented and placed on a recovery queue.
 508       The unhealthy interface is then pinged every
 509       <literal>lnet_recovery_interval</literal>. This value defaults to
 510       <literal>1</literal> second.</para>
 511       <para>If the peer receives a message from the gateway, then it immediately
 512       assumes that the gateway's interface is up and resets its health value to
 513       maximum. This is needed to ensure we start using the gateways immediately
 514       instead of holding off until the interface is back to full health.</para>
 515     </section>
 516     <section xml:id="mrrouting.health_discovery">
 517       <title><indexterm><primary>MR</primary>
 518         <secondary>mrrouting</secondary>
 519         <tertiary>routinghealth_discovery</tertiary>
 520         </indexterm>Discovery</title>
 521       <para>LNet Discovery is used in place of pinging the peers. This serves
 522       two purposes:</para>
 523       <orderedlist>
 524         <listitem><para>The discovery communication infrastructure does not need
 525         to be duplicated for the routing feature.</para></listitem>
 526         <listitem><para>It allows propagation of the gateway's interface state
 527         changes to the peers using the gateway.</para></listitem>
 528       </orderedlist>
 529       <para>For (2), if an interface changes state from <literal>UP</literal> to
 530       <literal>DOWN</literal> or vice versa, then a discovery
 531       <literal>PUSH</literal> is sent to all the peers which can be reached.
 532       This allows peers to adapt to changes quicker.</para>
 533       <para>Discovery is designed to be backwards compatible. The discovery
 534       protocol is composed of a <literal>GET</literal> and a
 535       <literal>PUT</literal>. The <literal>GET</literal> requests interface
 536       information from the peer, this is a basic lnet ping. The peer responds
 537       with its interface information and a feature bit. If the peer is
 538       multi-rail capable and discovery is turned on, then the node will
 539       <literal>PUSH</literal> its interface information. As a result both peers
 540       will be aware of each other's interfaces.</para>
 541       <para>This information is then used by the peers to decide, based on the
 542       interface state provided by the gateway, whether the route is alive or
 543       not.</para>
 544     </section>
 545     <section xml:id="mrrouting.health_aliveness">
 546       <title><indexterm><primary>MR</primary>
 547         <secondary>mrrouting</secondary>
 548         <tertiary>routinghealth_aliveness</tertiary>
 549         </indexterm>Route Aliveness Criteria</title>
 550       <para>A route is considered alive if the following conditions hold:</para>
 551       <orderedlist>
 552         <listitem><para>The gateway can be reached on the local net via at least
 553         one path.</para></listitem>
 554         <listitem><para> For a single-hop route, if
 555         <literal>avoid_asym_router_failure</literal> is
 556         enabled then the remote network defined in the route must have at least
 557         one healthy interface on the gateway.</para></listitem>
 558       </orderedlist>
 559     </section>
 560   </section>
 561   <section xml:id="mrhealth" condition="l2C">
 562     <title><indexterm><primary>MR</primary><secondary>health</secondary>
 563     </indexterm>LNet Health</title>
 564     <para>LNet Multi-Rail has implemented the ability for multiple interfaces
 565     to be used on the same LNet network or across multiple LNet networks.  The
 566     LNet Health feature adds the ability to maintain a health value for each
 567     local and remote interface. This allows the Multi-Rail algorithm to
 568     consider the health of the interface before selecting it for sending.
 569     The feature also adds the ability to resend messages across different
 570     interfaces when interface or network failures are detected. This allows
 571     LNet to mitigate communication failures before passing the failures to
 572     upper layers for further error handling. To accomplish this, LNet Health
 573     monitors the status of the send and receive operations and uses this
 574     status to increment the interface's health value in case of success and
 575     decrement it in case of failure.</para>
 576     <section xml:id="mrhealthvalue">
 577       <title><indexterm><primary>MR</primary>
 578         <secondary>mrhealth</secondary>
 579         <tertiary>value</tertiary>
 580       </indexterm>Health Value</title>
 581       <para>The initial health value of a local or remote interface is set to
 582       <literal>LNET_MAX_HEALTH_VALUE</literal>, currently set to be
 583       <literal>1000</literal>.  The value itself is arbitrary and is meant to
 584       allow for health granularity, as opposed to having a simple boolean state.
 585       The granularity allows the Multi-Rail algorithm to select the interface
 586       that has the highest likelihood of sending or receiving a message.</para>
 587     </section>
 588     <section xml:id="mrhealthfailuretypes">
 589       <title><indexterm><primary>MR</primary>
 590         <secondary>mrhealth</secondary>
 591         <tertiary>failuretypes</tertiary>
 592       </indexterm>Failure Types and Behavior</title>
 593       <para>LNet health behavior depends on the type of failure detected:</para>
 594       <informaltable frame="all">
 595         <tgroup cols="2">
 596         <colspec colname="c1" colwidth="50*"/>
 597         <colspec colname="c2" colwidth="50*"/>
 598         <thead>
 599           <row>
 600             <entry>
 601               <para><emphasis role="bold">Failure Type</emphasis></para>
 602             </entry>
 603             <entry>
 604               <para><emphasis role="bold">Behavior</emphasis></para>
 605             </entry>
 606           </row>
 607         </thead>
 608         <tbody>
 609           <row>
 610             <entry>
 611               <para><literal>localresend</literal></para>
 612             </entry>
 613             <entry>
 614               <para>A local failure has occurred, such as no route found or an
 615               address resolution error. These failures could be temporary,
 616               therefore LNet will attempt to resend the message. LNet will
 617               decrement the health value of the local interface and will
 618               select it less often if there are multiple available interfaces.
 619               </para>
 620             </entry>
 621           </row>
 622           <row>
 623             <entry>
 624               <para><literal>localno-resend</literal></para>
 625             </entry>
 626             <entry>
 627               <para>A local non-recoverable error occurred in the system, such
 628               as out of memory error. In these cases LNet will not attempt to
 629               resend the message. LNet will decrement the health value of the
 630               local interface and will select it less often if there are
 631               multiple available interfaces.
 632               </para>
 633             </entry>
 634           </row>
 635           <row>
 636             <entry>
 637               <para><literal>remoteno-resend</literal></para>
 638             </entry>
 639             <entry>
 640               <para>If LNet successfully sends a message, but the message does
 641               not complete or an expected reply is not received, then it is
 642               classified as a remote error. LNet will not attempt to resend the
 643               message to avoid duplicate messages on the remote end. LNet will
 644               decrement the health value of the remote interface and will
 645               select it less often if there are multiple available interfaces.
 646               </para>
 647             </entry>
 648           </row>
 649           <row>
 650             <entry>
 651               <para><literal>remoteresend</literal></para>
 652             </entry>
 653             <entry>
 654               <para>There are a set of failures where we can be reasonably sure
 655               that the message was dropped before getting to the remote end. In
 656               this case, LNet will attempt to resend the message. LNet will
 657               decrement the health value of the remote interface and will
 658               select it less often if there are multiple available interfaces.
 659               </para>
 660             </entry>
 661           </row>
 662         </tbody></tgroup>
 663       </informaltable>
 664     </section>
 665     <section xml:id="mrhealthinterface">
 666       <title><indexterm><primary>MR</primary>
 667         <secondary>mrhealth</secondary>
 668         <tertiary>interface</tertiary>
 669       </indexterm>User Interface</title>
 670       <para>LNet Health is turned on by default. There are multiple module
 671       parameters available to control the LNet Health feature.</para>
 672       <para>All the module parameters are implemented in sysfs and are located
 673       in /sys/module/lnet/parameters/. They can be set directly by echoing a
 674       value into them as well as from lnetctl.</para>
 675       <informaltable frame="all">
 676         <tgroup cols="2">
 677         <colspec colname="c1" colwidth="50*"/>
 678         <colspec colname="c2" colwidth="50*"/>
 679         <thead>
 680           <row>
 681             <entry>
 682               <para><emphasis role="bold">Parameter</emphasis></para>
 683             </entry>
 684             <entry>
 685               <para><emphasis role="bold">Description</emphasis></para>
 686             </entry>
 687           </row>
 688         </thead>
 689         <tbody>
 690           <row>
 691             <entry>
 692               <para><literal>lnet_health_sensitivity</literal></para>
 693             </entry>
 694             <entry>
 695               <para>When LNet detects a failure on a particular interface it
 696               will decrement its Health Value by
 697               <literal>lnet_health_sensitivity</literal>. The greater the value,
 698               the longer it takes for that interface to become healthy again.
 699               The default value of <literal>lnet_health_sensitivity</literal>
 700               is set to 100. To disable LNet health, the value can be set to 0.
 701               </para>
 702               <para>An <literal>lnet_health_sensitivity</literal> of 100 means
 703               that 10 consecutive message failures or a steady-state failure
 704               rate over 1% would degrade the interface Health Value until it is
 705               disabled, while a lower failure rate would steer traffic away from
 706               the interface but it would continue to be available.  When a
 707               failure occurs on an interface then its Health Value is
 708               decremented and the interface is flagged for recovery.</para>
 709               <screen>lnetctl set health_sensitivity: sensitivity to failure
 710       0 - turn off health evaluation
 711       &gt;0 - sensitivity value not more than 1000</screen>
 712             </entry>
 713           </row>
 714           <row>
 715             <entry>
 716               <para><literal>lnet_recovery_interval</literal></para>
 717             </entry>
 718             <entry>
 719               <para>When LNet detects a failure on a local or remote interface
 720               it will place that interface on a recovery queue. There is a
 721               recovery queue for local interfaces and another for remote
 722               interfaces. The interfaces on the recovery queues will be LNet
 723               PINGed every <literal>lnet_recovery_interval</literal>. This value
 724               defaults to <literal>1</literal> second. On every successful PING
 725               the health value of the interface pinged will be incremented by
 726               <literal>1</literal>.</para>
 727               <para>Having this value configurable allows system administrators
 728               to control the amount of control traffic on the network.</para>
 729               <screen>lnetctl set recovery_interval: interval to ping unhealthy interfaces
 730       &gt;0 - timeout in seconds</screen>
 731             </entry>
 732           </row>
 733           <row>
 734             <entry>
 735               <para><literal>lnet_transaction_timeout</literal></para>
 736             </entry>
 737             <entry>
 738               <para>This timeout is somewhat of an overloaded value. It carries
 739               the following functionality:</para>
 740               <itemizedlist>
 741                 <listitem>
 742                   <para>A message is abandoned if it is not sent successfully
 743                   when the lnet_transaction_timeout expires and the retry_count
 744                   is not reached.</para>
 745                 </listitem>
 746                 <listitem>
 747                   <para>A GET or a PUT which expects an ACK expires if a REPLY
 748                   or an ACK respectively, is not received within the
 749                   <literal>lnet_transaction_timeout</literal>.</para>
 750                 </listitem>
 751               </itemizedlist>
 752               <para>This value defaults to 30 seconds.</para>
 753               <screen>lnetctl set transaction_timeout: Message/Response timeout
 754       &gt;0 - timeout in seconds</screen>
 755               <note><para>The LND timeout will now be a fraction of the
 756               <literal>lnet_transaction_timeout</literal> as described in the
 757               next section.</para>
 758               <para>This means that in networks where very large delays are
 759               expected then it will be necessary to increase this value
 760               accordingly.</para></note>
 761             </entry>
 762           </row>
 763           <row>
 764             <entry>
 765               <para><literal>lnet_retry_count</literal></para>
 766             </entry>
 767             <entry>
 768               <para>When LNet detects a failure which it deems appropriate for
 769               re-sending a message it will check if a message has passed the
 770               maximum retry_count specified. After which if a message wasn't
 771               sent successfully a failure event will be passed up to the layer
 772               which initiated message sending. The default value is 2.</para>
 773               <para>Since the message retry interval
 774               (<literal>lnet_lnd_timeout</literal>) is computed from
 775               <literal>lnet_transaction_timeout / lnet_retry_count</literal>,
 776               the <literal>lnet_retry_count</literal> should be kept low enough
 777               that the retry interval is not shorter than the round-trip message
 778               delay in the network.  A <literal>lnet_retry_count</literal> of 5
 779               is reasonable for the default
 780               <literal>lnet_transaction_timeout</literal> of 50 seconds.</para>
 781               <screen>lnetctl set retry_count: number of retries
 782       0 - turn off retries
 783       &gt;0 - number of retries, cannot be more than <literal>lnet_transaction_timeout</literal></screen>
 784             </entry>
 785           </row>
 786           <row>
 787             <entry>
 788               <para><literal>lnet_lnd_timeout</literal></para>
 789             </entry>
 790             <entry>
 791               <para>This is not a configurable parameter. But it is derived from
 792               two configurable parameters:
 793               <literal>lnet_transaction_timeout</literal> and
 794               <literal>retry_count</literal>.</para>
 795               <screen>lnet_lnd_timeout = (lnet_transaction_timeout-1) / (retry_count+1)
 796               </screen>
 797               <para>As such there is a restriction that
 798               <literal>lnet_transaction_timeout &gt;= retry_count</literal>
 799               </para>
 800               <para>The core assumption here is that in a healthy network,
 801               sending and receiving LNet messages should not have large delays.
 802               There could be large delays with RPC messages and their responses,
 803               but that's handled at the PtlRPC layer.</para>
 804             </entry>
 805           </row>
 806         </tbody>
 807         </tgroup>
 808       </informaltable>
 809     </section>
 810     <section xml:id="mrhealthdisplay">
 811       <title><indexterm><primary>MR</primary>
 812         <secondary>mrhealth</secondary>
 813         <tertiary>display</tertiary>
 814       </indexterm>Displaying Information</title>
 815       <section xml:id="mrhealthdisplayhealth">
 816         <title>Showing LNet Health Configuration Settings</title>
 817         <para><literal>lnetctl</literal> can be used to show all the LNet health
 818         configuration settings using the <literal>lnetctl global show</literal>
 819         command.</para>
 820         <screen>#&gt; lnetctl global show
 821       global:
 822       numa_range: 0
 823       max_intf: 200
 824       discovery: 1
 825       retry_count: 3
 826       transaction_timeout: 10
 827       health_sensitivity: 100
 828       recovery_interval: 1</screen>
 829       </section>
 830       <section xml:id="mrhealthdisplaystats">
 831         <title>Showing LNet Health Statistics</title>
 832         <para>LNet Health statistics are shown under a higher verbosity
 833         settings.  To show the local interface health statistics:</para>
 834         <screen>lnetctl net show -v 3</screen>
 835         <para>To show the remote interface health statistics:</para>
 836         <screen>lnetctl peer show -v 3</screen>
 837         <para>Sample output:</para>
 838         <screen>#&gt; lnetctl net show -v 3
 839       net:
 840       - net type: tcp
 841         local NI(s):
 842            - nid: 192.168.122.108@tcp
 843              status: up
 844              interfaces:
 845                  0: eth2
 846              statistics:
 847                  send_count: 304
 848                  recv_count: 284
 849                  drop_count: 0
 850              sent_stats:
 851                  put: 176
 852                  get: 138
 853                  reply: 0
 854                  ack: 0
 855                  hello: 0
 856              received_stats:
 857                  put: 145
 858                  get: 137
 859                  reply: 0
 860                  ack: 2
 861                  hello: 0
 862              dropped_stats:
 863                  put: 10
 864                  get: 0
 865                  reply: 0
 866                  ack: 0
 867                  hello: 0
 868              health stats:
 869                  health value: 1000
 870                  interrupts: 0
 871                  dropped: 10
 872                  aborted: 0
 873                  no route: 0
 874                  timeouts: 0
 875                  error: 0
 876              tunables:
 877                  peer_timeout: 180
 878                  peer_credits: 8
 879                  peer_buffer_credits: 0
 880                  credits: 256
 881              dev cpt: -1
 882              tcp bonding: 0
 883              CPT: "[0]"
 884       CPT: &quot;[0]&quot;</screen>
 885         <para>There is a new YAML block, <literal>health stats</literal>, which
 886         displays the health statistics for each local or remote network
 887         interface.</para>
 888         <para>Global statistics also dump the global health statistics as shown
 889         below:</para>
 890         <screen>#&gt; lnetctl stats show
 891         statistics:
 892             msgs_alloc: 0
 893             msgs_max: 33
 894             rst_alloc: 0
 895             errors: 0
 896             send_count: 901
 897             resend_count: 4
 898             response_timeout_count: 0
 899             local_interrupt_count: 0
 900             local_dropped_count: 10
 901             local_aborted_count: 0
 902             local_no_route_count: 0
 903             local_timeout_count: 0
 904             local_error_count: 0
 905             remote_dropped_count: 0
 906             remote_error_count: 0
 907             remote_timeout_count: 0
 908             network_timeout_count: 0
 909             recv_count: 851
 910             route_count: 0
 911             drop_count: 10
 912             send_length: 425791628
 913             recv_length: 69852
 914             route_length: 0
 915             drop_length: 0</screen>
 916       </section>
 917     </section>
 918     <section xml:id="mrhealthinitialsetup">
 919       <title><indexterm><primary>MR</primary>
 920         <secondary>mrhealth</secondary>
 921         <tertiary>initialsetup</tertiary>
 922       </indexterm>Initial Settings Recommendations</title>
 923       <para>LNet Health is off by default. This means that
 924       <literal>lnet_health_sensitivity</literal> and
 925       <literal>lnet_retry_count</literal> are set to <literal>0</literal>.
 926       </para>
 927       <para>Setting <literal>lnet_health_sensitivity</literal> to
 928       <literal>0</literal> will not decrement the health of the interface on
 929       failure and will not change the interface selection behavior. Furthermore,
 930       the failed interfaces will not be placed on the recovery queues. In
 931       essence, turning off the LNet Health feature.</para>
 932       <para>The LNet Health settings will need to be tuned for each cluster.
 933       However, the base configuration would be as follows:</para>
 934       <screen>#&gt; lnetctl global show
 935     global:
 936         numa_range: 0
 937         max_intf: 200
 938         discovery: 1
 939         retry_count: 3
 940         transaction_timeout: 10
 941         health_sensitivity: 100
 942         recovery_interval: 1</screen>
 943       <para>This setting will allow a maximum of two retries for failed messages
 944       within the 5 second transaction timeout.</para>
 945       <para>If there is a failure on the interface the health value will be
 946       decremented by 1 and the interface will be LNet PINGed every 1 second.
 947       </para>
 948     </section>
 949   </section>
 950 </chapter>
 951 <!--
 952   vim:expandtab:shiftwidth=2:tabstop=8:
 953   -->