From 1c600ce8d3a2dec1134376536470736e35a855cf Mon Sep 17 00:00:00 2001 From: Joseph Gmitter Date: Thu, 24 Oct 2019 22:33:01 -0400 Subject: [PATCH] LUDOC-441 lnet: Add Multi-Rail Routing Documentation This patch adds the feature documentation for the LNet Health based routing work landed in LU-11297. Signed-off-by: Joseph Gmitter Change-Id: Id41cf9b16b142a0e6fb797b560a3a553714ff1fd Reviewed-on: https://review.whamcloud.com/36573 Tested-by: jenkins --- LNetMultiRail.xml | 202 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 200 insertions(+), 2 deletions(-) diff --git a/LNetMultiRail.xml b/LNetMultiRail.xml index 1cfbda5..a46a9bf 100644 --- a/LNetMultiRail.xml +++ b/LNetMultiRail.xml @@ -7,6 +7,7 @@ + @@ -234,8 +235,20 @@ peer: <indexterm><primary>MR</primary> <secondary>mrrouting</secondary> </indexterm>Notes on routing with Multi-Rail - Multi-Rail configuration can be applied on the Router to aggregate - the interfaces performance. + This section details how to configure Multi-Rail with the routing + feature before the feature landed in + Lustre 2.13. Routing code has always monitored the state of the route, in + order to avoid using unavailable ones. + This section describes how you can configure multiple interfaces on + the same gateway node but as different routes. This uses the existing route + monitoring algorithm to guard against interfaces going down. With the + feature introduced in Lustre 2.13, the + new algorithm uses the feature to + monitor the different interfaces of the gateway and always ensures that the + healthiest interface is used. Therefore, the configuration described in this + section applies to releases prior to Lustre 2.13. It will still work in + 2.13 as well, however it is not required due to the reason mentioned above. +
<indexterm><primary>MR</primary> <secondary>mrrouting</secondary> @@ -348,6 +361,191 @@ lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1</screen> This appears to be a common cluster upgrade scenario.</para> </section> </section> + <section xml:id="mrrouting.health" condition="l2D"> + <title><indexterm><primary>MR</primary> + <secondary>mrroutinghealth</secondary> + </indexterm>Multi-Rail Routing with LNet Health + This section details how routing and pertinent module parameters can + be configured beginning with Lustre 2.13. + Multi-Rail with Dynamic Discovery allows LNet to discover and use all + configured interfaces of a node. It references a node via it's primary NID. + Multi-Rail routing carries forward this concept to the routing + infrastructure. The following changes are brought in with the Lustre 2.13 + release: + + Configuring a different route per gateway interface is no + longer needed. One route per gateway should be configured. Gateway + interfaces are used according to the Multi-Rail selection criteria. + + Routing now relies on + to keep track of the route aliveness. + Router interfaces are monitored via LNet Health. + If an interface fails other interfaces will be used. + Routing uses LNet discovery to discover gateways on + regular intervals. + A gateway pushes its list of interfaces upon the discovery + of any changes in its interfaces' state. + +
+ <indexterm><primary>MR</primary> + <secondary>mrrouting</secondary> + <tertiary>routinghealth_config</tertiary> + </indexterm>Configuration +
+ Configuring Routes + A gateway can have multiple interfaces on the same or different + networks. The peers using the gateway can reach it on one or + more of its interfaces. Multi-Rail routing takes care of managing which + interface to use. + lnetctl route add --net <remote network> --gateway <NID for the gateway> + --hops <number of hops> --priority <route priority> +
+
+ Configuring Module Parameters + + Configuring Module Parameters + + + + + + + + Module Parameter + + + + + Usage + + + + + + + + check_routers_before_use + + + Defaults to 0. If set to + 1 all routers must be up before the system + can proceed. + + + + + avoid_asym_router_failure + + + Defaults to 1. If set to + 1 a route will be considered up if and only + if there exists at least one healthy interface on the local and + remote interfaces of the gateway. + + + + + alive_router_check_interval + + + Defaults to 60 seconds. The gateways + will be discovered ever + alive_router_check_interval. If the gateway + can be reached on multiple networks, the interval per network is + alive_router_check_interval / number of + networks. + + + + + router_ping_timeout + + + Defaults to 50 seconds. A gateway sets + its interface down if it has not received any traffic for + router_ping_timeout + alive_router_check_interval + + + + + + + router_sensitivity_percentage + + + Defaults to 100. This parameter defines + how sensitive a gateway interface is to failure. If set to 100 + then any gateway interface failure will contribute to all routes + using it going down. The lower the value the more tolerant to + failures the system becomes. + + + + +
+
+
+
+ <indexterm><primary>MR</primary> + <secondary>mrrouting</secondary> + <tertiary>routinghealth_routerhealth</tertiary> + </indexterm>Router Health + The routing infrastructure now relies on LNet Health to keep track + of interface health. Each gateway interface has a health value + associated with it. If a send fails to one of these interfaces, then the + interface's health value is decremented and placed on a recovery queue. + The unhealthy interface is then pinged every + lnet_recovery_interval. This value defaults to + 1 second. + If the peer receives a message from the gateway, then it immediately + assumes that the gateway's interface is up and resets its health value to + maximum. This is needed to ensure we start using the gateways immediately + instead of holding off until the interface is back to full health. +
+
+ <indexterm><primary>MR</primary> + <secondary>mrrouting</secondary> + <tertiary>routinghealth_discovery</tertiary> + </indexterm>Discovery + LNet Discovery is used in place of pinging the peers. This serves + two purposes: + + The discovery communication infrastructure does not need + to be duplicated for the routing feature. + It allows propagation of the gateway's interface state + changes to the peers using the gateway. + + For (2), if an interface changes state from UP to + DOWN or vice versa, then a discovery + PUSH is sent to all the peers which can be reached. + This allows peers to adapt to changes quicker. + Discovery is designed to be backwards compatible. The discovery + protocol is composed of a GET and a + PUT. The GET requests interface + information from the peer, this is a basic lnet ping. The peer responds + with its interface information and a feature bit. If the peer is + multi-rail capable and discovery is turned on, then the node will + PUSH its interface information. As a result both peers + will be aware of each other's interfaces. + This information is then used by the peers to decide, based on the + interface state provided by the gateway, whether the route is alive or + not. +
+
+ <indexterm><primary>MR</primary> + <secondary>mrrouting</secondary> + <tertiary>routinghealth_aliveness</tertiary> + </indexterm>Route Aliveness Criteria + A route is considered alive if the following conditions hold: + + The gateway can be reached on the local net via at least + one path. + If avoid_asym_router_failure is + enabled then the remote network defined in the route must have at least + one healthy interface on the gateway. + +
+
<indexterm><primary>MR</primary><secondary>health</secondary> </indexterm>LNet Health -- 1.8.3.1