+ <listitem>
+ <para>The clients and the servers are now configured with two
+ routes, each route's gateway is one of the interfaces of the
+ route. The clients and servers will view each interface of the
+ same router as a separate gateway and will monitor them as
+ described above.</para>
+ </listitem>
+ <listitem>
+ <para>The clients and the servers are not configured to view the
+ routers as MR capable. This is important because we want to deal
+ with each interface as a separate peers and not different
+ interfaces of the same peer.</para>
+ </listitem>
+ <listitem>
+ <para>The routers are configured to view the peers as MR capable.
+ This is an oddity in the configuration, but is currently required
+ in order to allow the routers to load balance the traffic load
+ across its interfaces evenly.</para>
+ </listitem>
+ </orderedlist>
+ </section>
+ <section xml:id="dbdoclet.mrroutingmixed">
+ <title><indexterm><primary>MR</primary>
+ <secondary>mrrouting</secondary>
+ <tertiary>routingmixed</tertiary>
+ </indexterm>Mixed Multi-Rail/Non-Multi-Rail Cluster</title>
+ <para>The above principles can be applied to mixed MR/Non-MR cluster.
+ For example, the same configuration shown above can be applied if the
+ clients and the servers are non-MR while the routers are MR capable.
+ This appears to be a common cluster upgrade scenario.</para>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.mrhealth" condition="l2C">
+ <title><indexterm><primary>MR</primary><secondary>health</secondary>
+ </indexterm>LNet Health</title>
+ <para>LNet Multi-Rail has implemented the ability for multiple interfaces
+ to be used on the same LNet network or across multiple LNet networks. The
+ LNet Health feature adds the ability to maintain a health value for each
+ local and remote interface. This allows the Multi-Rail algorithm to
+ consider the health of the interface before selecting it for sending.
+ The feature also adds the ability to resend messages across different
+ interfaces when interface or network failures are detected. This allows
+ LNet to mitigate communication failures before passing the failures to
+ upper layers for further error handling. To accomplish this, LNet Health
+ monitors the status of the send and receive operations and uses this
+ status to increment the interface's health value in case of success and
+ decrement it in case of failure.</para>
+ <section xml:id="dbdoclet.mrhealthvalue">
+ <title><indexterm><primary>MR</primary>
+ <secondary>mrhealth</secondary>
+ <tertiary>value</tertiary>
+ </indexterm>Health Value</title>
+ <para>The initial health value of a local or remote interface is set to
+ <literal>LNET_MAX_HEALTH_VALUE</literal>, currently set to be
+ <literal>1000</literal>. The value itself is arbitrary and is meant to
+ allow for health granularity, as opposed to having a simple boolean state.
+ The granularity allows the Multi-Rail algorithm to select the interface
+ that has the highest likelihood of sending or receiving a message.</para>
+ </section>
+ <section xml:id="dbdoclet.mrhealthfailuretypes">
+ <title><indexterm><primary>MR</primary>
+ <secondary>mrhealth</secondary>
+ <tertiary>failuretypes</tertiary>
+ </indexterm>Failure Types and Behavior</title>
+ <para>LNet health behavior depends on the type of failure detected:</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*"/>
+ <colspec colname="c2" colwidth="50*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Failure Type</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Behavior</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para><literal>localresend</literal></para>
+ </entry>
+ <entry>
+ <para>A local failure has occurred, such as no route found or an
+ address resolution error. These failures could be temporary,
+ therefore LNet will attempt to resend the message. LNet will
+ decrement the health value of the local interface and will
+ select it less often if there are multiple available interfaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>localno-resend</literal></para>
+ </entry>
+ <entry>
+ <para>A local non-recoverable error occurred in the system, such
+ as out of memory error. In these cases LNet will not attempt to
+ resend the message. LNet will decrement the health value of the
+ local interface and will select it less often if there are
+ multiple available interfaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>remoteno-resend</literal></para>
+ </entry>
+ <entry>
+ <para>If LNet successfully sends a message, but the message does
+ not complete or an expected reply is not received, then it is
+ classified as a remote error. LNet will not attempt to resend the
+ message to avoid duplicate messages on the remote end. LNet will
+ decrement the health value of the remote interface and will
+ select it less often if there are multiple available interfaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>remoteresend</literal></para>
+ </entry>
+ <entry>
+ <para>There are a set of failures where we can be reasonably sure
+ that the message was dropped before getting to the remote end. In
+ this case, LNet will attempt to resend the message. LNet will
+ decrement the health value of the remote interface and will
+ select it less often if there are multiple available interfaces.
+ </para>
+ </entry>
+ </row>
+ </tbody></tgroup>
+ </informaltable>
+ </section>
+ <section xml:id="dbdoclet.mrhealthinterface">
+ <title><indexterm><primary>MR</primary>
+ <secondary>mrhealth</secondary>
+ <tertiary>interface</tertiary>
+ </indexterm>User Interface</title>
+ <para>LNet Health is turned off by default. There are multiple module
+ parameters available to control the LNet Health feature.</para>
+ <para>All the module parameters are implemented in sysfs and are located
+ in /sys/module/lnet/parameters/. They can be set directly by echoing a
+ value into them as well as from lnetctl.</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*"/>
+ <colspec colname="c2" colwidth="50*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Parameter</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Description</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para><literal>lnet_health_sensitivity</literal></para>
+ </entry>
+ <entry>
+ <para>When LNet detects a failure on a particular interface it
+ will decrement its Health Value by
+ <literal>lnet_health_sensitivity</literal>. The greater the value,
+ the longer it takes for that interface to become healthy again.
+ The default value of <literal>lnet_health_sensitivity</literal>
+ is set to 0, which means the health value will not be decremented.
+ In essense, the health feature is turned off.</para>
+ <para>The sensitivity value can be set greater than 0. A
+ <literal>lnet_health_sensitivity</literal> of 100 would mean that
+ 10 consecutive message failures or a steady-state failure rate
+ over 1% would degrade the interface Health Value until it is
+ disabled, while a lower failure rate would steer traffic away from
+ the interface but it would continue to be available. When a
+ failure occurs on an interface then its Health Value is
+ decremented and the interface is flagged for recovery.</para>
+ <screen>lnetctl set health_sensitivity: sensitivity to failure
+ 0 - turn off health evaluation
+ >0 - sensitivity value not more than 1000</screen>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>lnet_recovery_interval</literal></para>
+ </entry>
+ <entry>
+ <para>When LNet detects a failure on a local or remote interface
+ it will place that interface on a recovery queue. There is a
+ recovery queue for local interfaces and another for remote
+ interfaces. The interfaces on the recovery queues will be LNet
+ PINGed every <literal>lnet_recovery_interval</literal>. This value
+ defaults to <literal>1</literal> second. On every successful PING
+ the health value of the interface pinged will be incremented by
+ <literal>1</literal>.</para>
+ <para>Having this value configurable allows system administrators
+ to control the amount of control traffic on the network.</para>
+ <screen>lnetctl set recovery_interval: interval to ping unhealthy interfaces
+ >0 - timeout in seconds</screen>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>lnet_transaction_timeout</literal></para>
+ </entry>
+ <entry>
+ <para>This timeout is somewhat of an overloaded value. It carries
+ the following functionality:</para>
+ <itemizedlist>
+ <listitem>
+ <para>A message is abandoned if it is not sent successfully
+ when the lnet_transaction_timeout expires and the retry_count
+ is not reached.</para>
+ </listitem>
+ <listitem>
+ <para>A GET or a PUT which expects an ACK expires if a REPLY
+ or an ACK respectively, is not received within the
+ <literal>lnet_transaction_timeout</literal>.</para>
+ </listitem>
+ </itemizedlist>
+ <para>This value defaults to 30 seconds.</para>
+ <screen>lnetctl set transaction_timeout: Message/Response timeout
+ >0 - timeout in seconds</screen>
+ <note><para>The LND timeout will now be a fraction of the
+ <literal>lnet_transaction_timeout</literal> as described in the
+ next section.</para>
+ <para>This means that in networks where very large delays are
+ expected then it will be necessary to increase this value
+ accordingly.</para></note>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>lnet_retry_count</literal></para>
+ </entry>
+ <entry>
+ <para>When LNet detects a failure which it deems appropriate for
+ re-sending a message it will check if a message has passed the
+ maximum retry_count specified. After which if a message wasn't
+ sent successfully a failure event will be passed up to the layer
+ which initiated message sending.</para>
+ <para>Since the message retry interval
+ (<literal>lnet_lnd_timeout</literal>) is computed from
+ <literal>lnet_transaction_timeout / lnet_retry_count</literal>,
+ the <literal>lnet_retry_count</literal> should be kept low enough
+ that the retry interval is not shorter than the round-trip message
+ delay in the network. A <literal>lnet_retry_count</literal> of 5
+ is reasonable for the default
+ <literal>lnet_transaction_timeout</literal> of 50 seconds.</para>
+ <screen>lnetctl set retry_count: number of retries
+ 0 - turn off retries
+ >0 - number of retries, cannot be more than <literal>lnet_transaction_timeout</literal></screen>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><literal>lnet_lnd_timeout</literal></para>
+ </entry>
+ <entry>
+ <para>This is not a configurable parameter. But it is derived from
+ two configurable parameters:
+ <literal>lnet_transaction_timeout</literal> and
+ <literal>retry_count</literal>.</para>
+ <screen>lnet_lnd_timeout = lnet_transaction_timeout / retry_count
+ </screen>
+ <para>As such there is a restriction that
+ <literal>lnet_transaction_timeout >= retry_count</literal>
+ </para>
+ <para>The core assumption here is that in a healthy network,
+ sending and receiving LNet messages should not have large delays.
+ There could be large delays with RPC messages and their responses,
+ but that's handled at the PtlRPC layer.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </section>
+ <section xml:id="dbdoclet.mrhealthdisplay">
+ <title><indexterm><primary>MR</primary>
+ <secondary>mrhealth</secondary>
+ <tertiary>display</tertiary>
+ </indexterm>Displaying Information</title>
+ <section xml:id="dbdoclet.mrhealthdisplayhealth">
+ <title>Showing LNet Health Configuration Settings</title>
+ <para><literal>lnetctl</literal> can be used to show all the LNet health
+ configuration settings using the <literal>lnetctl global show</literal>
+ command.</para>
+ <screen>#> lnetctl global show
+ global:
+ numa_range: 0
+ max_intf: 200
+ discovery: 1
+ retry_count: 3
+ transaction_timeout: 10
+ health_sensitivity: 100
+ recovery_interval: 1</screen>