From 9bbd6eea1e237d9c0a734d88f4ec14812be6e236 Mon Sep 17 00:00:00 2001 From: Linda Bebernes Date: Thu, 31 Oct 2013 14:14:30 -0700 Subject: [PATCH] LUDOC-50 LNET: Added section about LNET Peer Health to Chapter 26 "Lustre Tuning" Added section 26.3.7 "LNET Peer Health". Added ID to Section 36.2.2 describing keepalive information so that I could create a link from LNET peer health content to that information. Signed-off-by: Linda Bebernes Change-Id: If1cd51125df8c01424b1a11bbf1aeff5959b1f40 Reviewed-on: http://review.whamcloud.com/8127 Tested-by: Hudson Reviewed-by: Doug Oucharek Reviewed-by: Richard Henwood --- ConfigurationFilesModuleParameters.xml | 163 +++++++++++++++++++++++---------- LustreTuning.xml | 65 ++++++++++++- 2 files changed, 178 insertions(+), 50 deletions(-) diff --git a/ConfigurationFilesModuleParameters.xml b/ConfigurationFilesModuleParameters.xml index 0c68ebe..1093a14 100644 --- a/ConfigurationFilesModuleParameters.xml +++ b/ConfigurationFilesModuleParameters.xml @@ -267,12 +267,27 @@ forwarding ("") rnet_htable_size is an integer that indicates how many remote networks the internal LNet hash table is configured to handle. rnet_htable_size is used for optimizing the hash table size and does not put a limit on how many remote networks you can have. The default hash table size when this parameter is not specified is: 128. -
- <indexterm><primary>configuring</primary><secondary>network</secondary><tertiary>SOCKLND</tertiary></indexterm> -<literal>SOCKLND</literal> Kernel TCP/IP LND - The SOCKLND kernel TCP/IP LND (socklnd) is connection-based and uses the acceptor to establish communications via sockets with its peers. - It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd encounters. - Consider a node on the 'edge' of an InfiniBand network, with a low-bandwidth management Ethernet (eth0), IP over IB configured (ipoib0), and a pair of GigE NICs (eth1,eth2) providing off-cluster connectivity. This node should be configured with 'networks=vib,tcp(eth1,eth2)' to ensure that the socklnd ignores the management Ethernet and IPoIB. +
+ <indexterm> + <primary>configuring</primary> + <secondary>network</secondary> + <tertiary>SOCKLND</tertiary> + </indexterm> + <literal>SOCKLND</literal> Kernel TCP/IP LND + The SOCKLND kernel TCP/IP LND (socklnd) is + connection-based and uses the acceptor to establish communications via sockets with its + peers. + It supports multiple instances and load balances dynamically over multiple interfaces. + If no interfaces are specified by the ip2nets or networks module + parameter, all non-loopback IP interfaces are used. The address-within-network is determined + by the address of the first IP interface an instance of the socklnd + encounters. + Consider a node on the 'edge' of an InfiniBand network, with a low-bandwidth + management Ethernet (eth0), IP over IB configured + (ipoib0), and a pair of GigE NICs + (eth1,eth2) providing off-cluster connectivity. This + node should be configured with 'networks=vib,tcp(eth1,eth2)' to + ensure that the socklnd ignores the management Ethernet and IPoIB. @@ -290,17 +305,22 @@ forwarding ("") - timeout - (50,W) + + timeout + + (50,W) - Time (in seconds) that communications may be stalled before the LND completes them with failure. + Time (in seconds) that communications may be stalled before the LND completes + them with failure. - nconnds - (4) + + nconnds + + (4) Sets the number of connection daemons. @@ -308,17 +328,24 @@ forwarding ("") - min_reconnectms - (1000,W) + + min_reconnectms + + (1000,W) - Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms'. + Minimum connection retry interval (in milliseconds). After a failed connection + attempt, this is the time that must elapse before the first retry. As connections + attempts fail, this time is doubled on each successive retry up to a maximum of + 'max_reconnectms'. - max_reconnectms - (6000,W) + + max_reconnectms + + (6000,W) Maximum connection retry interval (in milliseconds). @@ -326,27 +353,38 @@ forwarding ("") - eager_ack - (0 on linux, - 1 on darwin,W) + + eager_ack + + (0 on linux, + + 1 on darwin,W) - Boolean that determines whether the socklnd should attempt to flush sends on message boundaries. + Boolean that determines whether the socklnd should attempt + to flush sends on message boundaries. - typed_conns - (1,Wc) + + typed_conns + + (1,Wc) - Boolean that determines whether the socklnd should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else. + Boolean that determines whether the socklnd should use + different sockets for different types of messages. When clear, all communication + with a particular peer takes place on the same socket. Otherwise, separate sockets + are used for bulk sends, bulk receives and everything else. - min_bulk - (1024,W) + + min_bulk + + (1024,W) Determines when a message is considered "bulk". @@ -354,69 +392,98 @@ forwarding ("") - tx_buffer_size, rx_buffer_size - (8388608,Wc) + + tx_buffer_size, rx_buffer_size + + (8388608,Wc) - Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes. + Socket buffer sizes. Setting this option to zero (0), allows the system to + auto-tune buffer sizes. - Be very careful changing this value as improper sizing can harm performance. + Be very careful changing this value as improper sizing can harm + performance. - nagle - (0,Wc) + + nagle + + (0,Wc) - Boolean that determines if nagle should be enabled. It should never be set in production systems. + Boolean that determines if nagle should be enabled. It + should never be set in production systems. - keepalive_idle - (30,Wc) + + keepalive_idle + + (30,Wc) - Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives. + Time (in seconds) that a socket can remain idle before a keepalive probe is + sent. Setting this value to zero (0) disables keepalives. - keepalive_intvl - (2,Wc) + + keepalive_intvl + + (2,Wc) - Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives. + Time (in seconds) to repeat unanswered keepalive probes. Setting this value to + zero (0) disables keepalives. - keepalive_count - (10,Wc) + + keepalive_count + + (10,Wc) - Number of unanswered keepalive probes before pronouncing socket (hence peer) death. + Number of unanswered keepalive probes before pronouncing socket (hence peer) + death. - enable_irq_affinity - (0,Wc) + + enable_irq_affinity + + (0,Wc) - Boolean that determines whether to enable IRQ affinity. The default is zero (0). - When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled. + Boolean that determines whether to enable IRQ affinity. The default is zero + (0). + When set, socklnd attempts to maximize performance by + handling device interrupts and data movement for particular (hardware) interfaces + on particular CPUs. This option is not available on all platforms. This option + requires an SMP system to exist and produces best performance with multiple NICs. + Systems with multiple CPUs and a single NIC may see increase in the performance + with this parameter disabled. - zc_min_frag - (2048,W) + + zc_min_frag + + (2048,W) - Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE disables all zero copy sends. This option is not available on all platforms. + Determines the minimum message fragment that should be considered for + zero-copy sends. Increasing it above the platform's PAGE_SIZE + disables all zero copy sends. This option is not available on all + platforms. diff --git a/LustreTuning.xml b/LustreTuning.xml index 79ab848..40d7fbd 100644 --- a/LustreTuning.xml +++ b/LustreTuning.xml @@ -1,5 +1,4 @@ - - + Lustre Tuning This chapter contains information about tuning Lustre for better performance and includes the following sections: @@ -246,6 +245,68 @@
+
+ LNET Peer Health + Two options are available to help determine peer health: + + peer_timeout - The timeout (in seconds) before an aliveness + query is sent to a peer. For example, if peer_timeout is set to + 180sec, an aliveness query is sent to the peer every 180 seconds. + This feature only takes effect if the node is configured as an LNET router. + In a routed environment, the peer_timeout feature should always + be on (set to a value in seconds) on routers. If the router checker has been enabled, + the feature should be turned off by setting it to 0 on clients and servers. + For a non-routed scenario, enabling the peer_timeout option + provides health information such as whether a peer is alive or not. For example, a + client is able to determine if an MGS or OST is up when it sends it a message. If a + response is received, the peer is alive; otherwise a timeout occurs when the request + is made. + In general, peer_timeout should be set to no less than the LND + timeout setting. For more information about LND timeouts, see . + When the o2iblnd (IB) driver is used, + peer_timeout should be at least twice the value of the + ko2iblnd keepalive option. for more information about keepalive + options, see . + + + avoid_asym_router_failure – When set to 1, the router checker + running on the client or a server periodically pings all the routers corresponding to + the NIDs identified in the routes parameter setting on the node to determine the + status of each router interface. The default setting is 1. (For more information about + the LNET routes parameter, see + A router is considered down if any of its NIDs are down. For example, router X has + three NIDs: Xnid1, Xnid2, and + Xnid3. A client is connected to the router via + Xnid1. The client has router checker enabled. The router checker + periodically sends a ping to the router via Xnid1. The router + responds to the ping with the status of each of its NIDs. In this case, it responds + with Xnid1=up, Xnid2=up, + Xnid3=down. If avoid_asym_router_failure==1, + the router is considered down if any of its NIDs are down, so router X is considered + down and will not be used for routing messages. If + avoid_asym_router_failure==0, router X will continue to be used + for routing messages. + + + The following router checker parameters must be set to the maximum value of the + corresponding setting for this option on any client or server: + + dead_router_check_interval + + + + live_router_check_interval + + + router_ping_timeout + + + For example, the dead_router_check_interval parameter on any router + must be MAX. +
<indexterm><primary>tuning</primary><secondary>libcfs</secondary></indexterm>libcfs Tuning -- 1.8.3.1