1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lnetmr" condition='l2A'>
2 <title xml:id="lnetmr.title">LNet Software Multi-Rail</title>
3 <para>This chapter describes LNet Software Multi-Rail configuration and
7 <para><xref linkend="dbdoclet.mroverview"/></para>
8 <para><xref linkend="dbdoclet.mrconfiguring"/></para>
9 <para><xref linkend="dbdoclet.mrrouting"/></para>
10 <para><xref linkend="dbdoclet.mrhealth"/></para>
13 <section xml:id="dbdoclet.mroverview">
14 <title><indexterm><primary>MR</primary><secondary>overview</secondary>
15 </indexterm>Multi-Rail Overview</title>
16 <para>In computer networking, multi-rail is an arrangement in which two or
17 more network interfaces to a single network on a computer node are employed,
18 to achieve increased throughput. Multi-rail can also be where a node has
19 one or more interfaces to multiple, even different kinds of networks, such
20 as Ethernet, Infiniband, and Intel® Omni-Path. For Lustre clients,
21 multi-rail generally presents the combined network capabilities as a single
22 LNet network. Peer nodes that are multi-rail capable are established during
23 configuration, as are user-defined interface-section policies.</para>
24 <para>The following link contains a detailed high-level design for the
26 <link xl:href="http://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf">
27 Multi-Rail High-Level Design</link></para>
29 <section xml:id="dbdoclet.mrconfiguring">
30 <title><indexterm><primary>MR</primary><secondary>configuring</secondary>
31 </indexterm>Configuring Multi-Rail</title>
32 <para>Every node using multi-rail networking needs to be properly
33 configured. Multi-rail uses <literal>lnetctl</literal> and the LNet
34 Configuration Library for configuration. Configuring multi-rail for a
35 given node involves two tasks:</para>
37 <listitem><para>Configuring multiple network interfaces present on the
38 local node.</para></listitem>
39 <listitem><para>Adding remote peers that are multi-rail capable (are
40 connected to one or more common networks with at least two interfaces).
43 <para>This section is a supplement to
44 <xref linkend="lnet_config.lnetaddshowdelete" /> and contains further
45 examples for Multi-Rail configurations.</para>
46 <para>For information on the dynamic peer discovery feature added in
47 Lustre Release 2.11.0, see
48 <xref linkend="lnet_config.dynamic_discovery" />.</para>
49 <section xml:id="dbdoclet.addinterfaces">
50 <title><indexterm><primary>MR</primary>
51 <secondary>multipleinterfaces</secondary>
52 </indexterm>Configure Multiple Interfaces on the Local Node</title>
53 <para>Example <literal>lnetctl add</literal> command with multiple
54 interfaces in a Multi-Rail configuration:</para>
55 <screen>lnetctl net add --net tcp --if eth0,eth1</screen>
56 <para>Example of YAML net show:</para>
57 <screen>lnetctl net show -v
70 peer_buffer_credits: 0
78 - nid: 192.168.122.10@tcp
89 peer_buffer_credits: 0
95 - nid: 192.168.122.11@tcp
106 peer_buffer_credits: 0
113 <section xml:id="dbdoclet.deleteinterfaces">
114 <title><indexterm><primary>MR</primary>
115 <secondary>deleteinterfaces</secondary>
116 </indexterm>Deleting Network Interfaces</title>
117 <para>Example delete with <literal>lnetctl net del</literal>:</para>
118 <para>Assuming the network configuration is as shown above with the
119 <literal>lnetctl net show -v</literal> in the previous section, we can
120 delete a net with following command:</para>
121 <screen>lnetctl net del --net tcp --if eth0</screen>
122 <para>The resultant net information would look like:</para>
123 <screen>lnetctl net show -v
136 peer_buffer_credits: 0
141 CPT: "[0,1,2,3]"</screen>
142 <para>The syntax of a YAML file to perform a delete would be:</para>
143 <screen>- net type: tcp
145 - nid: 192.168.122.10@tcp
149 <section xml:id="dbdoclet.addremotepeers">
150 <title><indexterm><primary>MR</primary>
151 <secondary>addremotepeers</secondary>
152 </indexterm>Adding Remote Peers that are Multi-Rail Capable</title>
153 <para>The following example <literal>lnetctl peer add</literal>
154 command adds a peer with 2 nids, with
155 <literal>192.168.122.30@tcp</literal> being the primary nid:</para>
156 <screen>lnetctl peer add --prim_nid 192.168.122.30@tcp --nid 192.168.122.30@tcp,192.168.122.31@tcp
158 <para>The resulting <literal>lnetctl peer show</literal> would be:
159 <screen>lnetctl peer show -v
161 - primary nid: 192.168.122.30@tcp
164 - nid: 192.168.122.30@tcp
167 available_tx_credits: 8
170 available_rtr_credits: 8
177 - nid: 192.168.122.31@tcp
180 available_tx_credits: 8
183 available_rtr_credits: 8
189 drop_count: 0</screen>
191 <para>The following is an example YAML file for adding a peer:</para>
194 - primary nid: 192.168.122.30@tcp
197 - nid: 192.168.122.31@tcp</screen>
199 <section xml:id="dbdoclet.deleteremotepeers">
200 <title><indexterm><primary>MR</primary>
201 <secondary>deleteremotepeers</secondary>
202 </indexterm>Deleting Remote Peers</title>
203 <para>Example of deleting a single nid of a peer (192.168.122.31@tcp):
205 <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp --nid 192.168.122.31@tcp</screen>
206 <para>Example of deleting the entire peer:</para>
207 <screen>lnetctl peer del --prim_nid 192.168.122.30@tcp</screen>
208 <para>Example of deleting a peer via YAML:</para>
209 <screen>Assuming the following peer configuration:
211 - primary nid: 192.168.122.30@tcp
214 - nid: 192.168.122.30@tcp
216 - nid: 192.168.122.31@tcp
218 - nid: 192.168.122.32@tcp
221 You can delete 192.168.122.32@tcp as follows:
225 - primary nid: 192.168.122.30@tcp
228 - nid: 192.168.122.32@tcp
230 % lnetctl import --del < delPeer.yaml</screen>
233 <section xml:id="dbdoclet.mrrouting">
234 <title><indexterm><primary>MR</primary>
235 <secondary>mrrouting</secondary>
236 </indexterm>Notes on routing with Multi-Rail</title>
237 <para>Multi-Rail configuration can be applied on the Router to aggregate
238 the interfaces performance.</para>
239 <section xml:id="dbdoclet.mrroutingex">
240 <title><indexterm><primary>MR</primary>
241 <secondary>mrrouting</secondary>
242 <tertiary>routingex</tertiary>
243 </indexterm>Multi-Rail Cluster Example</title>
244 <para>The below example outlines a simple system where all the Lustre
245 nodes are MR capable. Each node in the cluster has two interfaces.</para>
246 <figure xml:id="lnetmultirail.fig.routingdiagram">
247 <title>Routing Configuration with Multi-Rail</title>
250 <imagedata scalefit="1" width="100%"
251 fileref="./figures/MR_RoutingConfig.png" />
254 <phrase>Routing Configuration with Multi-Rail</phrase>
258 <para>The routers can aggregate the interfaces on each side of the network
259 by configuring them on the appropriate network.</para>
260 <para>An example configuration:</para>
262 lnetctl net add --net o2ib0 --if ib0,ib1
263 lnetctl net add --net o2ib1 --if ib2,ib3
264 lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,...
265 lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>>@o2ib1,...
266 lnetctl set routing 1
269 lnetctl net add --net o2ib0 --if ib0,ib1
270 lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib
271 lnetctl peer add --nid <rtrX-nidA>@o2ib,<rtrX-nidB>@o2ib
274 lnetctl net add --net o2ib1 --if ib0,ib1
275 lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1
276 lnetctl peer add --nid <rtrX-nidA>@o2ib1,<rtrX-nidB>@o2ib1</screen>
277 <para>In the above configuration the clients and the servers are
278 configured with only one route entry per router. This works because the
279 routers are MR capable. By adding the routers as peers with multiple
280 interfaces to the clients and the servers, when sending to the router the
281 MR algorithm will ensure that bot interfaces of the routers are used.
283 <para>However, as of the Lustre 2.10 release LNet Resiliency is still
284 under development and single interface failure will still cause the entire
285 router to go down.</para>
287 <section xml:id="dbdoclet.mrroutingresiliency">
288 <title><indexterm><primary>MR</primary>
289 <secondary>mrrouting</secondary>
290 <tertiary>routingresiliency</tertiary>
291 </indexterm>Utilizing Router Resiliency</title>
292 <para>Currently, LNet provides a mechanism to monitor each route entry.
293 LNet pings each gateway identified in the route entry on regular,
294 configurable interval to ensure that it is alive. If sending over a
295 specific route fails or if the router pinger determines that the gateway
296 is down, then the route is marked as down and is not used. It is
297 subsequently pinged on regular, configurable intervals to determine when
298 it becomes alive again.</para>
299 <para>This mechanism can be combined with the MR feature in Lustre 2.10 to
300 add this router resiliency feature to the configuration.</para>
302 lnetctl net add --net o2ib0 --if ib0,ib1
303 lnetctl net add --net o2ib1 --if ib2,ib3
304 lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,...
305 lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>@o2ib1,...
306 lnetctl set routing 1
309 lnetctl net add --net o2ib0 --if ib0,ib1
310 lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib
311 lnetctl route add --net o2ib1 --gateway <rtrX-nidB>@o2ib
314 lnetctl net add --net o2ib1 --if ib0,ib1
315 lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1
316 lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1</screen>
317 <para>There are a few things to note in the above configuration:</para>
320 <para>The clients and the servers are now configured with two
321 routes, each route's gateway is one of the interfaces of the
322 route. The clients and servers will view each interface of the
323 same router as a separate gateway and will monitor them as
324 described above.</para>
327 <para>The clients and the servers are not configured to view the
328 routers as MR capable. This is important because we want to deal
329 with each interface as a separate peers and not different
330 interfaces of the same peer.</para>
333 <para>The routers are configured to view the peers as MR capable.
334 This is an oddity in the configuration, but is currently required
335 in order to allow the routers to load balance the traffic load
336 across its interfaces evenly.</para>
340 <section xml:id="dbdoclet.mrroutingmixed">
341 <title><indexterm><primary>MR</primary>
342 <secondary>mrrouting</secondary>
343 <tertiary>routingmixed</tertiary>
344 </indexterm>Mixed Multi-Rail/Non-Multi-Rail Cluster</title>
345 <para>The above principles can be applied to mixed MR/Non-MR cluster.
346 For example, the same configuration shown above can be applied if the
347 clients and the servers are non-MR while the routers are MR capable.
348 This appears to be a common cluster upgrade scenario.</para>
351 <section xml:id="dbdoclet.mrhealth" condition="l2C">
352 <title><indexterm><primary>MR</primary><secondary>health</secondary>
353 </indexterm>LNet Health</title>
354 <para>LNet Multi-Rail has implemented the ability for multiple interfaces
355 to be used on the same LNet network or across multiple LNet networks. The
356 LNet Health feature adds the ability to maintain a health value for each
357 local and remote interface. This allows the Multi-Rail algorithm to
358 consider the health of the interface before selecting it for sending.
359 The feature also adds the ability to resend messages across different
360 interfaces when interface or network failures are detected. This allows
361 LNet to mitigate communication failures before passing the failures to
362 upper layers for further error handling. To accomplish this, LNet Health
363 monitors the status of the send and receive operations and uses this
364 status to increment the interface's health value in case of success and
365 decrement it in case of failure.</para>
366 <section xml:id="dbdoclet.mrhealthvalue">
367 <title><indexterm><primary>MR</primary>
368 <secondary>mrhealth</secondary>
369 <tertiary>value</tertiary>
370 </indexterm>Health Value</title>
371 <para>The initial health value of a local or remote interface is set to
372 <literal>LNET_MAX_HEALTH_VALUE</literal>, currently set to be
373 <literal>1000</literal>. The value itself is arbitrary and is meant to
374 allow for health granularity, as opposed to having a simple boolean state.
375 The granularity allows the Multi-Rail algorithm to select the interface
376 that has the highest likelihood of sending or receiving a message.</para>
378 <section xml:id="dbdoclet.mrhealthfailuretypes">
379 <title><indexterm><primary>MR</primary>
380 <secondary>mrhealth</secondary>
381 <tertiary>failuretypes</tertiary>
382 </indexterm>Failure Types and Behavior</title>
383 <para>LNet health behavior depends on the type of failure detected:</para>
384 <informaltable frame="all">
386 <colspec colname="c1" colwidth="50*"/>
387 <colspec colname="c2" colwidth="50*"/>
391 <para><emphasis role="bold">Failure Type</emphasis></para>
394 <para><emphasis role="bold">Behavior</emphasis></para>
401 <para><literal>localresend</literal></para>
404 <para>A local failure has occurred, such as no route found or an
405 address resolution error. These failures could be temporary,
406 therefore LNet will attempt to resend the message. LNet will
407 decrement the health value of the local interface and will
408 select it less often if there are multiple available interfaces.
414 <para><literal>localno-resend</literal></para>
417 <para>A local non-recoverable error occurred in the system, such
418 as out of memory error. In these cases LNet will not attempt to
419 resend the message. LNet will decrement the health value of the
420 local interface and will select it less often if there are
421 multiple available interfaces.
427 <para><literal>remoteno-resend</literal></para>
430 <para>If LNet successfully sends a message, but the message does
431 not complete or an expected reply is not received, then it is
432 classified as a remote error. LNet will not attempt to resend the
433 message to avoid duplicate messages on the remote end. LNet will
434 decrement the health value of the remote interface and will
435 select it less often if there are multiple available interfaces.
441 <para><literal>remoteresend</literal></para>
444 <para>There are a set of failures where we can be reasonably sure
445 that the message was dropped before getting to the remote end. In
446 this case, LNet will attempt to resend the message. LNet will
447 decrement the health value of the remote interface and will
448 select it less often if there are multiple available interfaces.
455 <section xml:id="dbdoclet.mrhealthinterface">
456 <title><indexterm><primary>MR</primary>
457 <secondary>mrhealth</secondary>
458 <tertiary>interface</tertiary>
459 </indexterm>User Interface</title>
460 <para>LNet Health is turned off by default. There are multiple module
461 parameters available to control the LNet Health feature.</para>
462 <para>All the module parameters are implemented in sysfs and are located
463 in /sys/module/lnet/parameters/. They can be set directly by echoing a
464 value into them as well as from lnetctl.</para>
465 <informaltable frame="all">
467 <colspec colname="c1" colwidth="50*"/>
468 <colspec colname="c2" colwidth="50*"/>
472 <para><emphasis role="bold">Parameter</emphasis></para>
475 <para><emphasis role="bold">Description</emphasis></para>
482 <para><literal>lnet_health_sensitivity</literal></para>
485 <para>When LNet detects a failure on a particular interface it
486 will decrement its Health Value by
487 <literal>lnet_health_sensitivity</literal>. The greater the value,
488 the longer it takes for that interface to become healthy again.
489 The default value of <literal>lnet_health_sensitivity</literal>
490 is set to 0, which means the health value will not be decremented.
491 In essense, the health feature is turned off.</para>
492 <para>The sensitivity value can be set greater than 0. A
493 <literal>lnet_health_sensitivity</literal> of 100 would mean that
494 10 consecutive message failures or a steady-state failure rate
495 over 1% would degrade the interface Health Value until it is
496 disabled, while a lower failure rate would steer traffic away from
497 the interface but it would continue to be available. When a
498 failure occurs on an interface then its Health Value is
499 decremented and the interface is flagged for recovery.</para>
500 <screen>lnetctl set health_sensitivity: sensitivity to failure
501 0 - turn off health evaluation
502 >0 - sensitivity value not more than 1000</screen>
507 <para><literal>lnet_recovery_interval</literal></para>
510 <para>When LNet detects a failure on a local or remote interface
511 it will place that interface on a recovery queue. There is a
512 recovery queue for local interfaces and another for remote
513 interfaces. The interfaces on the recovery queues will be LNet
514 PINGed every <literal>lnet_recovery_interval</literal>. This value
515 defaults to <literal>1</literal> second. On every successful PING
516 the health value of the interface pinged will be incremented by
517 <literal>1</literal>.</para>
518 <para>Having this value configurable allows system administrators
519 to control the amount of control traffic on the network.</para>
520 <screen>lnetctl set recovery_interval: interval to ping unhealthy interfaces
521 >0 - timeout in seconds</screen>
526 <para><literal>lnet_transaction_timeout</literal></para>
529 <para>This timeout is somewhat of an overloaded value. It carries
530 the following functionality:</para>
533 <para>A message is abandoned if it is not sent successfully
534 when the lnet_transaction_timeout expires and the retry_count
535 is not reached.</para>
538 <para>A GET or a PUT which expects an ACK expires if a REPLY
539 or an ACK respectively, is not received within the
540 <literal>lnet_transaction_timeout</literal>.</para>
543 <para>This value defaults to 30 seconds.</para>
544 <screen>lnetctl set transaction_timeout: Message/Response timeout
545 >0 - timeout in seconds</screen>
546 <note><para>The LND timeout will now be a fraction of the
547 <literal>lnet_transaction_timeout</literal> as described in the
549 <para>This means that in networks where very large delays are
550 expected then it will be necessary to increase this value
551 accordingly.</para></note>
556 <para><literal>lnet_retry_count</literal></para>
559 <para>When LNet detects a failure which it deems appropriate for
560 re-sending a message it will check if a message has passed the
561 maximum retry_count specified. After which if a message wasn't
562 sent successfully a failure event will be passed up to the layer
563 which initiated message sending.</para>
564 <para>Since the message retry interval
565 (<literal>lnet_lnd_timeout</literal>) is computed from
566 <literal>lnet_transaction_timeout / lnet_retry_count</literal>,
567 the <literal>lnet_retry_count</literal> should be kept low enough
568 that the retry interval is not shorter than the round-trip message
569 delay in the network. A <literal>lnet_retry_count</literal> of 5
570 is reasonable for the default
571 <literal>lnet_transaction_timeout</literal> of 50 seconds.</para>
572 <screen>lnetctl set retry_count: number of retries
574 >0 - number of retries, cannot be more than <literal>lnet_transaction_timeout</literal></screen>
579 <para><literal>lnet_lnd_timeout</literal></para>
582 <para>This is not a configurable parameter. But it is derived from
583 two configurable parameters:
584 <literal>lnet_transaction_timeout</literal> and
585 <literal>retry_count</literal>.</para>
586 <screen>lnet_lnd_timeout = lnet_transaction_timeout / retry_count
588 <para>As such there is a restriction that
589 <literal>lnet_transaction_timeout >= retry_count</literal>
591 <para>The core assumption here is that in a healthy network,
592 sending and receiving LNet messages should not have large delays.
593 There could be large delays with RPC messages and their responses,
594 but that's handled at the PtlRPC layer.</para>
601 <section xml:id="dbdoclet.mrhealthdisplay">
602 <title><indexterm><primary>MR</primary>
603 <secondary>mrhealth</secondary>
604 <tertiary>display</tertiary>
605 </indexterm>Displaying Information</title>
606 <section xml:id="dbdoclet.mrhealthdisplayhealth">
607 <title>Showing LNet Health Configuration Settings</title>
608 <para><literal>lnetctl</literal> can be used to show all the LNet health
609 configuration settings using the <literal>lnetctl global show</literal>
611 <screen>#> lnetctl global show
617 transaction_timeout: 10
618 health_sensitivity: 100
619 recovery_interval: 1</screen>
621 <section xml:id="dbdoclet.mrhealthdisplaystats">
622 <title>Showing LNet Health Statistics</title>
623 <para>LNet Health statistics are shown under a higher verbosity
624 settings. To show the local interface health statistics:</para>
625 <screen>lnetctl net show -v 3</screen>
626 <para>To show the remote interface health statistics:</para>
627 <screen>lnetctl peer show -v 3</screen>
628 <para>Sample output:</para>
629 <screen>#> lnetctl net show -v 3
633 - nid: 192.168.122.108@tcp
670 peer_buffer_credits: 0
675 CPT: "[0]"</screen>
676 <para>There is a new YAML block, <literal>health stats</literal>, which
677 displays the health statistics for each local or remote network
679 <para>Global statistics also dump the global health statistics as shown
681 <screen>#> lnetctl stats show
689 response_timeout_count: 0
690 local_interrupt_count: 0
691 local_dropped_count: 10
692 local_aborted_count: 0
693 local_no_route_count: 0
694 local_timeout_count: 0
696 remote_dropped_count: 0
697 remote_error_count: 0
698 remote_timeout_count: 0
699 network_timeout_count: 0
703 send_length: 425791628
706 drop_length: 0</screen>
709 <section xml:id="dbdoclet.mrhealthinitialsetup">
710 <title><indexterm><primary>MR</primary>
711 <secondary>mrhealth</secondary>
712 <tertiary>initialsetup</tertiary>
713 </indexterm>Initial Settings Recommendations</title>
714 <para>LNet Health is off by default. This means that
715 <literal>lnet_health_sensitivity</literal> and
716 <literal>lnet_retry_count</literal> are set to <literal>0</literal>.
718 <para>Setting <literal>lnet_health_sensitivity</literal> to
719 <literal>0</literal> will not decrement the health of the interface on
720 failure and will not change the interface selection behavior. Furthermore,
721 the failed interfaces will not be placed on the recovery queues. In
722 essence, turning off the LNet Health feature.</para>
723 <para>The LNet Health settings will need to be tuned for each cluster.
724 However, the base configuration would be as follows:</para>
725 <screen>#> lnetctl global show
731 transaction_timeout: 10
732 health_sensitivity: 100
733 recovery_interval: 1</screen>
734 <para>This setting will allow a maximum of two retries for failed messages
735 within the 5 second transaction timeout.</para>
736 <para>If there is a failure on the interface the health value will be
737 decremented by 1 and the interface will be LNet PINGed every 1 second.