From: Richard Henwood Date: Tue, 30 Oct 2012 19:58:08 +0000 (-0500) Subject: LUDOC-97 tuning: SMP node affinity parameters are now documented. X-Git-Tag: 2.3.0^0 X-Git-Url: https://git.whamcloud.com/?a=commitdiff_plain;h=9aa9c23a6a77d71b6bc7133dace5b9446660fd95;p=doc%2Fmanual.git LUDOC-97 tuning: SMP node affinity parameters are now documented. SMP Node Affinity arrived with Lustre 2.3 (LU-56). The default values SMP machines are judged to be suitable for most workloads. If and administrator does want to tinker, this change records the available parameters. Entries added to the glossary for NUMA and Node Affinity. Signed-off-by: Richard Henwood Signed-off-by: Liang Zhen Change-Id: Ib5dd25431046d2842cbde77f0b6fede47a790da7 Reviewed-on: http://review.whamcloud.com/4413 Tested-by: Hudson Reviewed-by: Doug Oucharek Reviewed-by: Liang Zhen Reviewed-by: Cliff White --- diff --git a/Glossary.xml b/Glossary.xml index f1c670a..b2f92d4 100644 --- a/Glossary.xml +++ b/Glossary.xml @@ -443,6 +443,20 @@ A subset of the LNET RPC module that implements a library for sending large network requests, moving buffers with RDMA. + + Node Affinity + + + Node Affinity describes the property of a multithreaded application to behave sensibably on multiple cores. Without the property of Node Affinity, an operating scheduler may move application threads accross processors in an sub-optimal way that significantly reduces performance of the application overall. + + + + NUMA + + + Non-Uniform Memory Access describes a mulitprocessing architecture where the time taken to access given memory differs depending on memory location releative to a given processor. Typically machines with multiple sockets are NUMA architectures. + + O diff --git a/LustreTuning.xml b/LustreTuning.xml index fa7e290..f37d5b0 100644 --- a/LustreTuning.xml +++ b/LustreTuning.xml @@ -7,9 +7,18 @@ + + + + + + + + + @@ -55,15 +64,25 @@ If there are too many threads, the latency for individual I/O requests can become very high and should be avoided. Set the desired maximum thread count permanently using the method described above. -
+
<indexterm><primary>tuning</primary><secondary>OSS threads</secondary></indexterm>Specifying the OSS Service Thread Count The oss_num_threads parameter enables the number of OST service threads to be specified at module load time on the OSS nodes: options ost oss_num_threads={N} After startup, the minimum and maximum number of OSS thread counts can be set via the {service}.thread_{min,max,started} tunable. To change the tunable at runtime, run: lctl {get,set}_param {service}.thread_{min,max,started} - For details, see . + Lustre 2.3 introduced binding service threads to CPU partition. This works in a similar fashion to binding of threads on MDS. MDS thread tuning is covered in . + + + oss_cpts=[EXPRESSION] binds the default OSS service on CPTs defined by [EXPRESSION]. + + + oss_io_cpts=[EXPRESSION] binds the IO OSS service on CPTs defined by [EXPRESSION]. + + + + For further details, see .
-
+
<indexterm><primary>tuning</primary><secondary>MDS threads</secondary></indexterm>Specifying the MDS Service Thread Count The mds_num_threads parameter enables the number of MDS service threads to be specified at module load time on the MDS node: options mds mds_num_threads={N} @@ -74,13 +93,37 @@ The OSS and MDS automatically start new service threads dynamically, in response to server load within a factor of 4. The default value is calculated the same way as before. Setting the _mu_threads module parameter disables automatic thread creation behavior. + Lustre 2.3 introduced new parameters to provide more control to administrators. + + + mds_rdpg_num_threads controls the number of threads in providing the read page service. The read page service handles file close and readdir operations. + + + mds_attr_num_threads controls the number of threads in providing the setattr service to 1.8 clients. + + + Default values for the thread counts are automatically selected. The values are chosen to best exploit the number of CPUs present in the system and to provide best overall performance for typical workloads.
+
+ <indexterm><primary>tuning</primary><secondary>MDS binding</secondary></indexterm>Binding MDS Service Thread to CPU Partitions + With the introduction of Node Affinity () in Lustre 2.3, MDS threads can be bound to particular CPU Partitions (CPTs). Default values for bindings are selected automatically to provide good overall performance for a given CPU count. However, an administrator can deviate from these setting if they choose. + + + mds_num_cpts=[EXPRESSION] binds the default MDS service threads to CPTs defined by EXPRESSION. For example mdt_num_cpts=[0-3] will bind the MDS service threads to CPT[0,1,2,3]. + + + mds_rdpg_num_cpts=[EXPRESSION] binds the read page service threads to CPTs defined by EXPRESSION. The read page service handles file close and readdir requests. For example mdt_rdpg_num_cpts=[4] will bind the read page threads to CPT4. + + + mds_attr_num_cpts=[EXPRESSION] binds the setattr service threads to CPTs defined by EXPRESSION. + + +
<indexterm><primary>LNET</primary><secondary>tuning</secondary> - </indexterm><indexterm><primary>tuning</primary><secondary>LNET</secondary></indexterm> - Tuning LNET Parameters + tuningLNETTuning LNET Parameters This section describes LNET tunables. that may be necessary on some systems to improve performance. To test the performance of your Lustre network, see Chapter 23: Testing Lustre Network Performance (LNET Self-Test).
Transmit and Receive Buffer Size @@ -98,6 +141,75 @@ options ksocklnd enable_irq_affinity=0 By default, this parameter is off. As always, you should test the performance to compare the impact of changing this parameter.
+
<indexterm><primary>tuning</primary><secondary>Network interface binding</secondary></indexterm>Binding Network Interface Against CPU Partitions + Luster 2.3 and beyond provide enhanced network interface control. The enhancement means that an administrator can bind an interface to one or more CPU Partitions. Bindings are specified as options to the lnet modules. For more information on specifying module options, see +For example, o2ib0(ib0)[0,1] will ensure that all messages for o2ib0 will be handled by LND threads executing on CPT0 and CPT1. An additional example might be: tcp1(eth0)[0]. Messages for tcp1 are handled by threads on CPT0. +
+
<indexterm><primary>tuning</primary><secondary>Network interface credits</secondary></indexterm>Network Interface Credits + Network interface (NI) credits are shared across all CPU partitions (CPT). For example, a machine has 4 CPTs and NI credits is 512, then each partition will has 128 credits. If a large number of CPTs exist on the system, LNet will check and validate the NI credits value for each CPT to ensure each CPT has workable number of credits. For example, a machine has 16 CPTs and NI credits is set to 256, then each partition only has 16 credits. 16 NI credits is low and could negatively impact performance. As a result, LNet will automatically make an adjustment to 8*peer_credits (peer_credits is 8 by default), so credits for each partition is still 64. + Modifying the NI Credit count can be performed by an administrator using ksoclnd or ko2iblnd. For example: + ksocklnd credits=256 + applies 256 credits to TCP connections. Applying 256 credits to IB connections can be achieved with: + ko2iblnd credits=256 + From Lustre 2.3 and beyond, it is possible that LNet may revalidate the NI Credits and the administrator's request do not persist. +
+
<indexterm><primary>tuning</primary><secondary>router buffers</secondary></indexterm>Router Buffers + Router buffers are shared by all CPU partitions. For a machine with a large number of CPTs, the router buffer number may need to be specified manually for best performance. A low number of router buffers risks starving the CPU Partitions of resources. + The default setting for router buffers will typically perform well. LNet automatically sets a default value to reduce the likelihood of resource starvation + An administrator may modify router buffers using the large_router_buffers parameter. For example: + lnet large_router_buffers=8192 + From Lustre 2.3 and beyond, it is possible that LNet may revalidate the router buffer setting and the administrator's request do not persist. +
+
<indexterm><primary>tuning</primary><secondary>portal round-robin</secondary></indexterm>Portal Round-Robin + Portal round-robin defines the policy LNet applies to deliver events and messages to the upper layers. The upper layers are ptlrpc service or LNet selftest. + If portal round-robin is disabled, LNet will deliver messages to CPTs based on a hash of the source NID. Hence, all messages from a specific peer will be handled by the same CPT. This can reduce data traffic between CPUs. However, for some workloads, this behavior may result in poorly balancing loads across the CPU. + If portal round-robin is enabled, LNet will round-robin incoming events across all CPTs. This may balance load better across the CPU but can incur a cross CPU overhead. + The current policy can be changed by an administrator with echo <VALUE> > /proc/sys/lnet/portal_rotor. There are four options for <VALUE>: + + + OFF + Disable portal round-robin on all incoming requests. + + + ON + Enable portal round-robin on all incoming requests. + + + RR_RT + Enable portal round-robin only for routed messages. + + + HASH_RT + Routed messages will be delivered to the upper layer by hash of source NID (instead of NID of router.) This is the default value. + + + +
+
+
+ <indexterm><primary>tuning</primary><secondary>libcfs</secondary></indexterm>libcfs Tuning +By default, Lustre will automatically generate CPU Partitions (CPT) based on the number of CPUs in the system. The CPT number will be 1 if the online CPU number is less than five. + The CPT number can be explicitly set on the libcfs module using cpu_npartitions=NUMBER. The value of cpu_npartitions must be an integer between 1 and the number of online CPUs. +Setting CPT to 1 will disable most of the SMP Node Affinity functionality. +
+ CPU Partition String Patterns + CPU Partitions can be described using string pattern notation. For example: + + + cpu_pattern="0[0,2,4,6] 1[1,3,5,7] + Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains CPU[1,3,5,7]. + + cpu_pattern="N 0[0-3] 1[4-7] + Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1 contains all CPUs in NUMA node [4-7]. + + + The current configuration of the CPU partition can be read from /proc/sys/lnet/cpu_paratitions +
+
+
+ <indexterm><primary>tuning</primary><secondary>LND tuning</secondary></indexterm>LND Tuning + LND tuning allows the number of threads per CPU partition to be specified. An administrator can set the threads for both ko2iblnd and ksocklnd using the nscheds parameter. This adjusts the number of threads for each partition, not the overall number of threads on the LND. + Lustre 2.3 has greatly decreased the default number of threads for ko2iblnd and ksocklnd on high-core count machines. The current default values are automatically set and are chosen to work well across a number of typical scenarios.
<indexterm><primary>tuning</primary><secondary>lockless I/O</secondary></indexterm>Lockless I/O Tunables