LustreTuning.xml

   1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretuning">
   2   <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
   3   <para>This chapter contains information about tuning a Lustre file system for better performance
   4     and includes the following sections:</para>
   5   <itemizedlist>
   6     <listitem>
   7       <para><xref linkend="dbdoclet.50438272_55226"/></para>
   8     </listitem>
   9     <listitem>
  10       <para><xref linkend="dbdoclet.mdstuning"/></para>
  11     </listitem>
  12     <listitem>
  13       <para><xref linkend="dbdoclet.50438272_73839"/></para>
  14     </listitem>
  15     <listitem>
  16       <para><xref linkend="dbdoclet.libcfstuning"/></para>
  17     </listitem>
  18     <listitem>
  19       <para><xref linkend="dbdoclet.lndtuning"/></para>
  20     </listitem>
  21     <listitem>
  22       <para><xref linkend="dbdoclet.nrstuning"/></para>
  23     </listitem>
  24     <listitem>
  25       <para><xref linkend="dbdoclet.50438272_25884"/></para>
  26     </listitem>
  27     <listitem>
  28       <para><xref linkend="dbdoclet.50438272_80545"/></para>
  29     </listitem>
  30     <listitem>
  31       <para><xref linkend="dbdoclet.50438272_45406"/></para>
  32     </listitem>
  33   </itemizedlist>
  34   <note>
  35     <para>Many options in the Lustre software are set by means of kernel module parameters. These
  36       parameters are contained in the <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
  37   </note>
  38   <section xml:id="dbdoclet.50438272_55226">
  39       <title>
  40           <indexterm><primary>tuning</primary></indexterm>
  41 <indexterm><primary>tuning</primary><secondary>service threads</secondary></indexterm>
  42           Optimizing the Number of Service Threads</title>
  43     <para>An OSS can have a minimum of two service threads and a maximum of 512 service threads. The
  44       number of service threads is a function of how much RAM and how many CPUs are on each OSS node
  45       (1 thread / 128MB * num_cpus). If the load on the OSS node is high, new service threads will
  46       be started in order to process more requests concurrently, up to 4x the initial number of
  47       threads (subject to the maximum of 512). For a 2GB 2-CPU system, the default thread count is
  48       32 and the maximum thread count is 128.</para>
  49     <para>Increasing the size of the thread pool may help when:</para>
  50     <itemizedlist>
  51       <listitem>
  52         <para>Several OSTs are exported from a single OSS</para>
  53       </listitem>
  54       <listitem>
  55         <para>Back-end storage is running synchronously</para>
  56       </listitem>
  57       <listitem>
  58         <para>I/O completions take excessive time due to slow storage</para>
  59       </listitem>
  60     </itemizedlist>
  61     <para>Decreasing the size of the thread pool may help if:</para>
  62     <itemizedlist>
  63       <listitem>
  64         <para>Clients are overwhelming the storage capacity</para>
  65       </listitem>
  66       <listitem>
  67         <para>There are lots of &quot;slow I/O&quot; or similar messages</para>
  68       </listitem>
  69     </itemizedlist>
  70     <para>Increasing the number of I/O threads allows the kernel and storage to aggregate many writes together for more efficient disk I/O. The OSS thread pool is shared--each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal I/O buffers.</para>
  71     <para>It is very important to consider memory consumption when increasing the thread pool size. Drives are only able to sustain a certain amount of parallel I/O activity before performance is degraded, due to the high number of seeks and the OST threads just waiting for I/O. In this situation, it may be advisable to decrease the load by decreasing the number of OST threads.</para>
  72     <para>Determining the optimum number of OST threads is a process of trial and error, and varies for each particular configuration. Variables include the number of OSTs on each OSS, number and speed of disks, RAID configuration, and available RAM. You may want to start with a number of OST threads equal to the number of actual disk spindles on the node. If you use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and monitor the performance of clients during usual workloads. If performance is degraded, increase the thread count and see how that works until performance is degraded again or you reach satisfactory performance.</para>
  73     <note>
  74       <para>If there are too many threads, the latency for individual I/O requests can become very high and should be avoided. Set the desired maximum thread count permanently using the method described above.</para>
  75     </note>
  76     <section>
  77       <title><indexterm><primary>tuning</primary><secondary>OSS threads</secondary></indexterm>Specifying the OSS Service Thread Count</title>
  78       <para>The <literal>oss_num_threads</literal> parameter enables the number of OST service threads to be specified at module load time on the OSS nodes:</para>
  79       <screen>options ost oss_num_threads={N}</screen>
  80       <para>After startup, the minimum and maximum number of OSS thread counts can be set via the <literal>{service}.thread_{min,max,started}</literal> tunable. To change the tunable at runtime, run:</para>
  81       <para><screen>lctl {get,set}_param {service}.thread_{min,max,started}</screen></para>
  82       <para>Lustre software release 2.3 introduced binding service threads to CPU partition. This
  83         works in a similar fashion to binding of threads on MDS. MDS thread tuning is covered in
  84           <xref linkend="dbdoclet.mdsbinding"/>.</para>
  85     <itemizedlist>
  86       <listitem>
  87         <para><literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service on CPTs defined by <literal>[EXPRESSION]</literal>.</para>
  88       </listitem>
  89       <listitem>
  90         <para><literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service on CPTs defined by <literal>[EXPRESSION]</literal>.</para>
  91       </listitem>
  92     </itemizedlist>
  93
  94       <para>For further details, see <xref linkend="dbdoclet.50438271_87260"/>.</para>
  95     </section>
  96     <section xml:id="dbdoclet.mdstuning">
  97       <title><indexterm><primary>tuning</primary><secondary>MDS threads</secondary></indexterm>Specifying the MDS Service Thread Count</title>
  98       <para>The <literal>mds_num_threads</literal> parameter enables the number of MDS service threads to be specified at module load time on the MDS node:</para>
  99       <screen>options mds mds_num_threads={N}</screen>
 100       <para>After startup, the minimum and maximum number of MDS thread counts can be set via the <literal>{service}.thread_{min,max,started}</literal> tunable. To change the tunable at runtime, run:</para>
 101       <para><screen>lctl {get,set}_param {service}.thread_{min,max,started}</screen></para>
 102       <para>For details, see <xref linkend="dbdoclet.50438271_87260"/>.</para>
 103       <para>At this time, no testing has been done to determine the optimal number of MDS threads. The default value varies, based on server size, up to a maximum of 32. The maximum number of threads (<literal>MDS_MAX_THREADS</literal>) is 512.</para>
 104       <note>
 105         <para>The OSS and MDS automatically start new service threads dynamically, in response to server load within a factor of 4. The default value is calculated the same way as before. Setting the <literal>_mu_threads</literal> module parameter disables automatic thread creation behavior.</para>
 106       </note>
 107         <para>Lustre software release 2.3 introduced new parameters to provide more control to
 108         administrators.</para>
 109             <itemizedlist>
 110         <listitem>
 111           <para><literal>mds_rdpg_num_threads</literal> controls the number of threads in providing
 112             the read page service. The read page service handles file close and readdir
 113             operations.</para>
 114         </listitem>
 115         <listitem>
 116           <para><literal>mds_attr_num_threads</literal> controls the number of threads in providing
 117             the setattr service to clients running Lustre software release 1.8.</para>
 118         </listitem>
 119       </itemizedlist>
 120         <note><para>Default values for the thread counts are automatically selected. The values are chosen to best exploit the number of CPUs present in the system and to provide best overall performance for typical workloads.</para></note>
 121     </section>
 122   </section>
 123     <section xml:id="dbdoclet.mdsbinding" condition='l23'>
 124       <title><indexterm><primary>tuning</primary><secondary>MDS binding</secondary></indexterm>Binding MDS Service Thread to CPU Partitions</title>
 125         <para>With the introduction of Node Affinity (<xref linkend="nodeaffdef"/>) in Lustre software
 126       release 2.3, MDS threads can be bound to particular CPU partitions (CPTs). Default values for
 127       bindings are selected automatically to provide good overall performance for a given CPU count.
 128       However, an administrator can deviate from these setting if they choose.</para>
 129             <itemizedlist>
 130               <listitem>
 131                 <para><literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS service threads to CPTs defined by <literal>EXPRESSION</literal>. For example <literal>mdt_num_cpts=[0-3]</literal> will bind the MDS service threads to <literal>CPT[0,1,2,3]</literal>.</para>
 132               </listitem>
 133               <listitem>
 134                 <para><literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page service threads to CPTs defined by <literal>EXPRESSION</literal>. The read page service handles file close and readdir requests. For example <literal>mdt_rdpg_num_cpts=[4]</literal> will bind the read page threads to <literal>CPT4</literal>.</para>
 135               </listitem>
 136               <listitem>
 137                 <para><literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr service threads to CPTs defined by <literal>EXPRESSION</literal>.</para>
 138               </listitem>
 139             </itemizedlist>
 140   </section>
 141   <section xml:id="dbdoclet.50438272_73839">
 142       <title>
 143       <indexterm>
 144         <primary>LNET</primary>
 145         <secondary>tuning</secondary>
 146       </indexterm><indexterm>
 147         <primary>tuning</primary>
 148         <secondary>LNET</secondary>
 149       </indexterm>Tuning LNET Parameters</title>
 150     <para>This section describes LNET tunables, the use of which may be necessary on some systems to
 151       improve performance. To test the performance of your Lustre network, see <xref linkend='lnetselftest'/>.</para>
 152     <section remap="h3">
 153       <title>Transmit and Receive Buffer Size</title>
 154       <para>The kernel allocates buffers for sending and receiving messages on a network.</para>
 155       <para><literal>ksocklnd</literal> has separate parameters for the transmit and receive buffers.</para>
 156       <screen>options ksocklnd tx_buffer_size=0 rx_buffer_size=0
 157 </screen>
 158       <para>If these parameters are left at the default value (0), the system automatically tunes the transmit and receive buffer size. In almost every case, this default produces the best performance. Do not attempt to tune these parameters unless you are a network expert.</para>
 159     </section>
 160     <section remap="h3">
 161       <title>Hardware Interrupts (<literal>enable_irq_affinity</literal>)</title>
 162       <para>The hardware interrupts that are generated by network adapters may be handled by any CPU in the system. In some cases, we would like network traffic to remain local to a single CPU to help keep the processor cache warm and minimize the impact of context switches. This is helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs. To enable the <literal>enable_irq_affinity</literal> parameter, enter:</para>
 163       <screen>options ksocklnd enable_irq_affinity=1</screen>
 164       <para>In other cases, if you have an SMP platform with a single fast interface such as 10 Gb
 165         Ethernet and more than two CPUs, you may see performance improve by turning this parameter
 166         off.</para>
 167       <screen>options ksocklnd enable_irq_affinity=0</screen>
 168       <para>By default, this parameter is off. As always, you should test the performance to compare the impact of changing this parameter.</para>
 169     </section>
 170         <section condition='l23'><title><indexterm><primary>tuning</primary><secondary>Network interface binding</secondary></indexterm>Binding Network Interface Against CPU Partitions</title>
 171         <para>Lustre software release 2.3 and beyond provide enhanced network interface control. The
 172         enhancement means that an administrator can bind an interface to one or more CPU partitions.
 173         Bindings are specified as options to the LNET modules. For more information on specifying
 174         module options, see <xref linkend="dbdoclet.50438293_15350"/></para>
 175         <para>For example, <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages
 176         for <literal>o2ib0</literal> will be handled by LND threads executing on
 177           <literal>CPT0</literal> and <literal>CPT1</literal>. An additional example might be:
 178           <literal>tcp1(eth0)[0]</literal>. Messages for <literal>tcp1</literal> are handled by
 179         threads on <literal>CPT0</literal>.</para>
 180     </section>
 181         <section><title><indexterm><primary>tuning</primary><secondary>Network interface credits</secondary></indexterm>Network Interface Credits</title>
 182       <para>Network interface (NI) credits are shared across all CPU partitions (CPT). For example,
 183         if a machine has four CPTs and the number of NI credits is 512, then each partition has 128
 184         credits. If a large number of CPTs exist on the system, LNET checks and validates the NI
 185         credits for each CPT to ensure each CPT has a workable number of credits. For example, if a
 186         machine has 16 CPTs and the number of NI credits is 256, then each partition only has 16
 187         credits. 16 NI credits is low and could negatively impact performance. As a result, LNET
 188         automatically adjusts the credits to 8*<literal>peer_credits</literal>
 189           (<literal>peer_credits</literal> is 8 by default), so each partition has 64
 190         credits.</para>
 191       <para>Increasing the number of <literal>credits</literal>/<literal>peer_credits</literal> can
 192         improve the performance of high latency networks (at the cost of consuming more memory) by
 193         enabling LNET to send more inflight messages to a specific network/peer and keep the
 194         pipeline saturated.</para>
 195       <para>An administrator can modify the NI credit count using <literal>ksoclnd</literal> or
 196           <literal>ko2iblnd</literal>. In the example below, 256 credits are applied to TCP
 197         connections.</para>
 198       <screen>ksocklnd credits=256</screen>
 199       <para>Applying 256 credits to IB connections can be achieved with:</para>
 200       <screen>ko2iblnd credits=256</screen>
 201       <note condition="l23">
 202         <para>In Lustre software release 2.3 and beyond, LNET may revalidate the NI credits, so the
 203           administrator's request may not persist.</para>
 204       </note>
 205         </section>
 206         <section><title><indexterm><primary>tuning</primary><secondary>router buffers</secondary></indexterm>Router Buffers</title>
 207       <para>When a node is set up as an LNET router, three pools of buffers are allocated: tiny,
 208         small and large. These pools are allocated per CPU partition and are used to buffer messages
 209         that arrive at the router to be forwarded to the next hop. The three different buffer sizes
 210         accommodate different size messages. </para>
 211       <para>If a message arrives that can fit in a tiny buffer then a tiny buffer is used, if a
 212         message doesn’t fit in a tiny buffer, but fits in a small buffer, then a small buffer is
 213         used. Finally if a message does not fit in either a tiny buffer or a small buffer, a large
 214         buffer is used.</para>
 215       <para>Router buffers are shared by all CPU partitions. For a machine with a large number of
 216         CPTs, the router buffer number may need to be specified manually for best performance. A low
 217         number of router buffers risks starving the CPU partitions of resources.</para>
 218       <itemizedlist>
 219         <listitem>
 220           <para><literal>tiny_router_buffers</literal>: Zero payload buffers used for signals and
 221             acknowledgements.</para>
 222         </listitem>
 223         <listitem>
 224           <para><literal>small_router_buffers</literal>: 4 KB payload buffers for small
 225             messages</para>
 226         </listitem>
 227         <listitem>
 228           <para><literal>large_router_buffers</literal>: 1 MB maximum payload buffers, corresponding
 229             to the recommended RPC size of 1 MB.</para>
 230         </listitem>
 231       </itemizedlist>
 232       <para>The default setting for router buffers typically results in acceptable performance. LNET
 233         automatically sets a default value to reduce the likelihood of resource starvation. The size
 234         of a router buffer can be modified as shown in the example below. In this example, the size
 235         of the large buffer is modified using the <literal>large_router_buffers</literal>
 236         parameter.</para>
 237       <screen>lnet large_router_buffers=8192</screen>
 238       <note condition="l23">
 239         <para>In Lustre software release 2.3 and beyond, LNET may revalidate the router buffer
 240           setting, so the administrator's request may not persist.</para>
 241       </note>
 242         </section>
 243         <section><title><indexterm><primary>tuning</primary><secondary>portal round-robin</secondary></indexterm>Portal Round-Robin</title>
 244         <para>Portal round-robin defines the policy LNET applies to deliver events and messages to the
 245         upper layers. The upper layers are PLRPC service or LNET selftest.</para>
 246         <para>If portal round-robin is disabled, LNET will deliver messages to CPTs based on a hash of the
 247         source NID. Hence, all messages from a specific peer will be handled by the same CPT. This
 248         can reduce data traffic between CPUs. However, for some workloads, this behavior may result
 249         in poorly balancing loads across the CPU.</para>
 250         <para>If portal round-robin is enabled, LNET will round-robin incoming events across all CPTs. This
 251         may balance load better across the CPU but can incur a cross CPU overhead.</para>
 252         <para>The current policy can be changed by an administrator with <literal>echo <replaceable>value</replaceable> &gt; /proc/sys/lnet/portal_rotor</literal>. There are four options for <literal><replaceable>value</replaceable></literal>:</para>
 253     <itemizedlist>
 254       <listitem>
 255         <para><literal>OFF</literal></para>
 256                 <para>Disable portal round-robin on all incoming requests.</para>
 257       </listitem>
 258       <listitem>
 259         <para><literal>ON</literal></para>
 260                 <para>Enable portal round-robin on all incoming requests.</para>
 261       </listitem>
 262       <listitem>
 263         <para><literal>RR_RT</literal></para>
 264                 <para>Enable portal round-robin only for routed messages.</para>
 265       </listitem>
 266       <listitem>
 267         <para><literal>HASH_RT</literal></para>
 268                 <para>Routed messages will be delivered to the upper layer by hash of source NID (instead of NID of router.) This is the default value.</para>
 269       </listitem>
 270     </itemizedlist>
 271
 272     </section>
 273     <section>
 274       <title>LNET Peer Health</title>
 275       <para>Two options are available to help determine peer health:<itemizedlist>
 276           <listitem>
 277             <para><literal>peer_timeout</literal> - The timeout (in seconds) before an aliveness
 278               query is sent to a peer. For example, if <literal>peer_timeout</literal> is set to
 279                 <literal>180sec</literal>, an aliveness query is sent to the peer every 180 seconds.
 280               This feature only takes effect if the node is configured as an LNET router.</para>
 281             <para>In a routed environment, the <literal>peer_timeout</literal> feature should always
 282               be on (set to a value in seconds) on routers. If the router checker has been enabled,
 283               the feature should be turned off by setting it to 0 on clients and servers.</para>
 284             <para>For a non-routed scenario, enabling the <literal>peer_timeout</literal> option
 285               provides health information such as whether a peer is alive or not. For example, a
 286               client is able to determine if an MGS or OST is up when it sends it a message. If a
 287               response is received, the peer is alive; otherwise a timeout occurs when the request
 288               is made.</para>
 289             <para>In general, <literal>peer_timeout</literal> should be set to no less than the LND
 290               timeout setting. For more information about LND timeouts, see <xref
 291                 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/>.</para>
 292             <para>When the <literal>o2iblnd</literal> (IB) driver is used,
 293                 <literal>peer_timeout</literal> should be at least twice the value of the
 294                 <literal>ko2iblnd</literal> keepalive option. for more information about keepalive
 295               options, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 296                 linkend="section_ngq_qhy_zl"/>.</para>
 297           </listitem>
 298           <listitem>
 299             <para><literal>avoid_asym_router_failure</literal> – When set to 1, the router checker
 300               running on the client or a server periodically pings all the routers corresponding to
 301               the NIDs identified in the routes parameter setting on the node to determine the
 302               status of each router interface. The default setting is 1. (For more information about
 303               the LNET routes parameter, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 304                 linkend="dbdoclet.50438216_71227"/></para>
 305             <para>A router is considered down if any of its NIDs are down. For example, router X has
 306               three NIDs: <literal>Xnid1</literal>, <literal>Xnid2</literal>, and
 307                 <literal>Xnid3</literal>. A client is connected to the router via
 308                 <literal>Xnid1</literal>. The client has router checker enabled. The router checker
 309               periodically sends a ping to the router via <literal>Xnid1</literal>. The router
 310               responds to the ping with the status of each of its NIDs. In this case, it responds
 311               with <literal>Xnid1=up</literal>, <literal>Xnid2=up</literal>,
 312                 <literal>Xnid3=down</literal>. If <literal>avoid_asym_router_failure==1</literal>,
 313               the router is considered down if any of its NIDs are down, so router X is considered
 314               down and will not be used for routing messages. If
 315                 <literal>avoid_asym_router_failure==0</literal>, router X will continue to be used
 316               for routing messages.</para>
 317           </listitem>
 318         </itemizedlist></para>
 319       <para>The following router checker parameters must be set to the maximum value of the
 320         corresponding setting for this option on any client or server:<itemizedlist>
 321           <listitem>
 322             <para><literal>dead_router_check_interval</literal></para>
 323           </listitem>
 324           <listitem>
 325             <para>
 326               <literal>live_router_check_interval</literal></para>
 327           </listitem>
 328           <listitem>
 329             <para><literal>router_ping_timeout</literal></para>
 330           </listitem>
 331         </itemizedlist></para>
 332       <para>For example, the <literal>dead_router_check_interval</literal> parameter on any router
 333         must be MAX.</para>
 334     </section>
 335   </section>
 336   <section xml:id="dbdoclet.libcfstuning">
 337       <title><indexterm><primary>tuning</primary><secondary>libcfs</secondary></indexterm>libcfs Tuning</title>
 338 <para>By default, the Lustre software will automatically generate CPU partitions (CPT) based on the
 339       number of CPUs in the system. The CPT number will be 1 if the online CPU number is less than
 340       five.</para>
 341         <para>The CPT number can be explicitly set on the libcfs module using <literal>cpu_npartitions=NUMBER</literal>. The value of <literal>cpu_npartitions</literal> must be an integer between 1 and the number of online CPUs.</para>
 342 <tip><para>Setting CPT to 1 will disable most of the SMP Node Affinity functionality.</para></tip>
 343         <section>
 344                 <title>CPU Partition String Patterns</title>
 345         <para>CPU partitions can be described using string pattern notation. For example:</para>
 346     <itemizedlist>
 347       <listitem>
 348         <para><literal>cpu_pattern="0[0,2,4,6] 1[1,3,5,7]</literal></para>
 349                 <para>Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains CPU[1,3,5,7].</para>
 350       </listitem>
 351       <listitem> <para><literal>cpu_pattern="N 0[0-3] 1[4-7]</literal></para>
 352                 <para>Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1 contains all CPUs in NUMA node [4-7].</para>
 353       </listitem>
 354     </itemizedlist>
 355         <para>The current configuration of the CPU partition can be read from
 356           <literal>/proc/sys/lnet/cpu_partitions</literal></para>
 357         </section>
 358   </section>
 359   <section xml:id="dbdoclet.lndtuning">
 360       <title><indexterm><primary>tuning</primary><secondary>LND tuning</secondary></indexterm>LND Tuning</title>
 361       <para>LND tuning allows the number of threads per CPU partition to be specified. An administrator can set the threads for both <literal>ko2iblnd</literal> and <literal>ksocklnd</literal> using the <literal>nscheds</literal> parameter. This adjusts the number of threads for each partition, not the overall number of threads on the LND.</para>
 362                 <note><para>Lustre software release 2.3 has greatly decreased the default number of threads for
 363           <literal>ko2iblnd</literal> and <literal>ksocklnd</literal> on high-core count machines.
 364         The current default values are automatically set and are chosen to work well across a number
 365         of typical scenarios.</para></note>
 366   </section>
 367   <section xml:id="dbdoclet.nrstuning" condition='l24'>
 368     <title><indexterm><primary>tuning</primary><secondary>Network Request Scheduler (NRS) Tuning</secondary></indexterm>Network Request Scheduler (NRS) Tuning</title>
 369       <para>The Network Request Scheduler (NRS) allows the administrator to influence the order in which RPCs are handled at servers, on a per-PTLRPC service basis, by providing different policies that can be activated and tuned in order to influence the RPC ordering. The aim of this is to provide for better performance, and possibly discrete performance characteristics using future policies.</para>
 370       <para>The NRS policy state of a PTLRPC service can be read and set via the <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC service's NRS policy state, run:</para>
 371       <screen>lctl get_param {service}.nrs_policies</screen>
 372       <para>For example, to read the NRS policy state of the <literal>ost_io</literal> service,
 373       run:</para>
 374       <screen>$ lctl get_param ost.OSS.ost_io.nrs_policies
 375 ost.OSS.ost_io.nrs_policies=
 376
 377 regular_requests:
 378   - name: fifo
 379     state: started
 380     fallback: yes
 381     queued: 0
 382     active: 0
 383
 384   - name: crrn
 385     state: stopped
 386     fallback: no
 387     queued: 0
 388     active: 0
 389
 390   - name: orr
 391     state: stopped
 392     fallback: no
 393     queued: 0
 394     active: 0
 395
 396   - name: trr
 397     state: started
 398     fallback: no
 399     queued: 2420
 400     active: 268
 401
 402 high_priority_requests:
 403   - name: fifo
 404     state: started
 405     fallback: yes
 406     queued: 0
 407     active: 0
 408
 409   - name: crrn
 410     state: stopped
 411     fallback: no
 412     queued: 0
 413     active: 0
 414
 415   - name: orr
 416     state: stopped
 417     fallback: no
 418     queued: 0
 419     active: 0
 420
 421   - name: trr
 422     state: stopped
 423     fallback: no
 424     queued: 0
 425     active: 0
 426       </screen>
 427       <para>NRS policy state is shown in either one or two sections, depending on the PTLRPC service being queried. The first section is named <literal>regular_requests</literal> and is available for all PTLRPC services, optionally followed by a second section which is named <literal>high_priority_requests</literal>. This is because some PTLRPC services are able to treat some types of RPCs as higher priority ones, such that they are handled by the server with higher priority compared to other, regular RPC traffic. For PTLRPC services that do not support high-priority RPCs, you will only see the <literal>regular_requests</literal> section.</para>
 428       <para>There is a separate instance of each NRS policy on each PTLRPC service for handling regular and high-priority RPCs (if the service supports high-priority RPCs). For each policy instance, the following fields are shown:</para>
 429       <informaltable frame="all">
 430         <tgroup cols="2">
 431           <colspec colname="c1" colwidth="50*"/>
 432           <colspec colname="c2" colwidth="50*"/>
 433           <thead>
 434             <row>
 435               <entry>
 436                 <para><emphasis role="bold">Field</emphasis></para>
 437               </entry>
 438               <entry>
 439                 <para><emphasis role="bold">Description</emphasis></para>
 440               </entry>
 441             </row>
 442           </thead>
 443           <tbody>
 444             <row>
 445               <entry>
 446                 <para> <literal> name </literal></para>
 447               </entry>
 448               <entry>
 449                 <para>The name of the policy.</para>
 450               </entry>
 451             </row>
 452             <row>
 453               <entry>
 454                 <para> <literal> state </literal></para>
 455               </entry>
 456               <entry>
 457                       <para>The state of the policy; this can be any of <literal>invalid, stopping, stopped, starting, started</literal>. A fully enabled policy is in the <literal> started</literal> state.</para>
 458               </entry>
 459             </row>
 460             <row>
 461               <entry>
 462                 <para> <literal> fallback </literal></para>
 463               </entry>
 464               <entry>
 465                       <para>Whether the policy is acting as a fallback policy or not. A fallback policy is used to handle RPCs that other enabled policies fail to handle, or do not support the handling of. The possible values are <literal>no, yes</literal>. Currently, only the FIFO policy can act as a fallback policy.</para>
 466               </entry>
 467             </row>
 468             <row>
 469               <entry>
 470                 <para> <literal> queued </literal></para>
 471               </entry>
 472               <entry>
 473                 <para>The number of RPCs that the policy has waiting to be serviced.</para>
 474               </entry>
 475             </row>
 476             <row>
 477               <entry>
 478                 <para> <literal> active </literal></para>
 479               </entry>
 480               <entry>
 481                 <para>The number of RPCs that the policy is currently handling.</para>
 482               </entry>
 483             </row>
 484           </tbody>
 485         </tgroup>
 486       </informaltable>
 487       <para>To enable an NRS policy on a PTLRPC service run:</para>
 488       <screen>lctl set_param {service}.nrs_policies=<replaceable>policy_name</replaceable></screen>
 489       <para>This will enable the policy <replaceable>policy_name</replaceable> for both regular and high-priority RPCs (if the PLRPC service supports high-priority RPCs) on the given service. For example, to enable the CRR-N NRS policy for the ldlm_cbd service, run:</para>
 490       <screen>$ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
 491 ldlm.services.ldlm_cbd.nrs_policies=crrn
 492       </screen>
 493       <para>For PTLRPC services that support high-priority RPCs, you can also supply an optional <replaceable>reg|hp</replaceable> token, in order to enable an NRS policy for handling only regular or high-priority RPCs on a given PTLRPC service, by running:</para>
 494       <screen>lctl set_param {service}.nrs_policies="<replaceable>policy_name</replaceable> <replaceable>reg|hp</replaceable>"</screen>
 495       <para>For example, to enable the TRR policy for handling only regular, but not high-priority
 496       RPCs on the <literal>ost_io</literal> service, run:</para>
 497       <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
 498 ost.OSS.ost_io.nrs_policies="trr reg"
 499       </screen>
 500       <note>
 501         <para>When enabling an NRS policy, the policy name must be given in lower-case characters, otherwise the operation will fail with an error message.</para>
 502       </note>
 503     <section>
 504       <title><indexterm>
 505           <primary>tuning</primary>
 506           <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 507           <tertiary>first in, first out (FIFO) policy</tertiary>
 508         </indexterm>First In, First Out (FIFO) policy</title>
 509       <para>The first in, first out (FIFO) policy handles RPCs in a service in the same order as
 510         they arrive from the LNET layer, so no special processing takes place to modify the RPC
 511         handling stream. FIFO is the default policy for all types of RPCs on all PTLRPC services,
 512         and is always enabled irrespective of the state of other policies, so that it can be used as
 513         a backup policy, in case a more elaborate policy that has been enabled fails to handle an
 514         RPC, or does not support handling a given type of RPC.</para>
 515       <para> The FIFO policy has no tunables that adjust its behaviour.</para>
 516     </section>
 517     <section>
 518       <title><indexterm>
 519           <primary>tuning</primary>
 520           <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 521           <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
 522         </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
 523       <para>The client round-robin over NIDs (CRR-N) policy performs batched round-robin scheduling
 524         of all types of RPCs, with each batch consisting of RPCs originating from the same client
 525         node, as identified by its NID. CRR-N aims to provide for better resource utilization across
 526         the cluster, and to help shorten completion times of jobs in some cases, by distributing
 527         available bandwidth more evenly across all clients.</para>
 528       <para>The CRR-N policy can be enabled on all types of PTLRPC services, and has the following
 529         tunable that can be used to adjust its behavior:</para>
 530       <itemizedlist>
 531         <listitem>
 532           <para><literal>{service}.nrs_crrn_quantum</literal></para>
 533           <para>The <literal>{service}.nrs_crrn_quantum</literal> tunable determines the maximum allowed size of each batch of RPCs; the unit of measure is in number of RPCs. To read the maximum allowed batch size of a CRR-N policy, run:</para>
 534           <screen>lctl get_param {service}.nrs_crrn_quantum</screen>
 535           <para>For example, to read the maximum allowed batch size of a CRR-N policy on the ost_io service, run:</para>
 536           <screen>$ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
 537 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
 538 hp_quantum:8
 539           </screen>
 540           <para>You can see that there is a separate maximum allowed batch size value for regular (<literal>reg_quantum</literal>) and high-priority (<literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports high-priority RPCs).</para>
 541           <para>To set the maximum allowed batch size of a CRR-N policy on a given service, run:</para>
 542           <screen>lctl set_param {service}.nrs_crrn_quantum=<replaceable>1-65535</replaceable></screen>
 543           <para>This will set the maximum allowed batch size on a given service, for both regular and high-priority RPCs (if the PLRPC service supports high-priority RPCs), to the indicated value.</para>
 544           <para>For example, to set the maximum allowed batch size on the ldlm_canceld service to 16 RPCs, run:</para>
 545           <screen>$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 546 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 547           </screen>
 548           <para>For PTLRPC services that support high-priority RPCs, you can also specify a different maximum allowed batch size for regular and high-priority RPCs, by running:</para>
 549           <screen>$ lctl set_param {service}.nrs_crrn_quantum=<replaceable>reg_quantum|hp_quantum</replaceable>:<replaceable>1-65535</replaceable>"</screen>
 550           <para>For example, to set the maximum allowed batch size on the ldlm_canceld service, for high-priority RPCs to 32, run:</para>
 551           <screen>$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
 552 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
 553           </screen>
 554           <para>By using the last method, you can also set the maximum regular and high-priority RPC batch sizes to different values, in a single command invocation.</para>
 555         </listitem>
 556       </itemizedlist>
 557     </section>
 558     <section>
 559       <title><indexterm>
 560           <primary>tuning</primary>
 561           <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 562           <tertiary>object-based round-robin (ORR) policy</tertiary>
 563         </indexterm>Object-based Round-Robin (ORR) policy</title>
 564       <para>The object-based round-robin (ORR) policy performs batched round-robin scheduling of
 565         bulk read write (brw) RPCs, with each batch consisting of RPCs that pertain to the same
 566         backend-file system object, as identified by its OST FID.</para>
 567       <para>The ORR policy is only available for use on the ost_io service. The RPC batches it forms can potentially consist of mixed bulk read and bulk write RPCs. The RPCs in each batch are ordered in an ascending manner, based on either the file offsets, or the physical disk offsets of each RPC (only applicable to bulk read RPCs).</para>
 568       <para>The aim of the ORR policy is to provide for increased bulk read throughput in some cases, by ordering bulk read RPCs (and potentially bulk write RPCs), and thus minimizing costly disk seek operations. Performance may also benefit from any resulting improvement in resource utilization, or by taking advantage of better locality of reference between RPCs.</para>
 569       <para>The ORR policy has the following tunables that can be used to adjust its behaviour:</para>
 570       <itemizedlist>
 571         <listitem>
 572           <para><literal>ost.OSS.ost_io.nrs_orr_quantum</literal></para>
 573           <para>The <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines the maximum allowed size of each batch of RPCs; the unit of measure is in number of RPCs. To read the maximum allowed batch size of the ORR policy, run:</para>
 574           <screen>$ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
 575 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
 576 hp_quantum:16
 577           </screen>
 578           <para>You can see that there is a separate maximum allowed batch size value for regular (<literal>reg_quantum</literal>) and high-priority (<literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports high-priority RPCs).</para>
 579           <para>To set the maximum allowed batch size for the ORR policy, run:</para>
 580           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=<replaceable>1-65535</replaceable></screen>
 581           <para>This will set the maximum allowed batch size for both regular and high-priority RPCs, to the indicated value.</para>
 582           <para>You can also specify a different maximum allowed batch size for regular and high-priority RPCs, by running:</para>
 583           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=<replaceable>reg_quantum|hp_quantum</replaceable>:<replaceable>1-65535</replaceable></screen>
 584           <para>For example, to set the maximum allowed batch size for regular RPCs to 128, run:</para>
 585           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
 586 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
 587           </screen>
 588           <para>By using the last method, you can also set the maximum regular and high-priority RPC batch sizes to different values, in a single command invocation.</para>
 589         </listitem>
 590         <listitem>
 591           <para><literal>ost.OSS.ost_io.nrs_orr_offset_type</literal></para>
 592           <para>The <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable determines whether the ORR policy orders RPCs within each batch based on logical file offsets or physical disk offsets. To read the offset type value for the ORR policy, run:</para>
 593           <screen>$ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
 594 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
 595 hp_offset_type:logical
 596           </screen>
 597           <para>You can see that there is a separate offset type value for regular (<literal>reg_offset_type</literal>) and high-priority (<literal>hp_offset_type</literal>) RPCs.</para>
 598           <para>To set the ordering type for the ORR policy, run:</para>
 599           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=<replaceable>physical|logical</replaceable></screen>
 600           <para>This will set the offset type for both regular and high-priority RPCs, to the indicated value.</para>
 601           <para>You can also specify a different offset type for regular and high-priority RPCs, by running:</para>
 602           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=<replaceable>reg_offset_type|hp_offset_type</replaceable>:<replaceable>physical|logical</replaceable></screen>
 603           <para>For example, to set the offset type for high-priority RPCs to physical disk offsets, run:</para>
 604           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
 605 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical</screen>
 606           <para>By using the last method, you can also set offset type for regular and high-priority RPCs to different values, in a single command invocation.</para>
 607           <note><para>Irrespective of the value of this tunable, only logical offsets can, and are used for ordering bulk write RPCs.</para></note>
 608         </listitem>
 609         <listitem>
 610           <para><literal>ost.OSS.ost_io.nrs_orr_supported</literal></para>
 611           <para>The <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines the type of RPCs that the ORR policy will handle. To read the types of supported RPCs by the ORR policy, run:</para>
 612           <screen>$ lctl get_param ost.OSS.ost_io.nrs_orr_supported
 613 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
 614 hp_supported=reads_and_writes
 615           </screen>
 616           <para>You can see that there is a separate supported 'RPC types' value for regular (<literal>reg_supported</literal>) and high-priority (<literal>hp_supported</literal>) RPCs.</para>
 617           <para>To set the supported RPC types for the ORR policy, run:</para>
 618           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=<replaceable>reads|writes|reads_and_writes</replaceable></screen>
 619           <para>This will set the supported RPC types for both regular and high-priority RPCs, to the indicated value.</para>
 620           <para>You can also specify a different supported 'RPC types' value for regular and high-priority RPCs, by running:</para>
 621           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=<replaceable>reg_supported|hp_supported</replaceable>:<replaceable>reads|writes|reads_and_writes</replaceable></screen>
 622           <para>For example, to set the supported RPC types to bulk read and bulk write RPCs for regular requests, run:</para>
 623           <screen>$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
 624 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
 625           </screen>
 626           <para>By using the last method, you can also set the supported RPC types for regular and high-priority RPC to different values, in a single command invocation.</para>
 627         </listitem>
 628       </itemizedlist>
 629     </section>
 630     <section>
 631       <title><indexterm>
 632           <primary>tuning</primary>
 633           <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 634           <tertiary>Target-based round-robin (TRR) policy</tertiary>
 635         </indexterm>Target-based Round-Robin (TRR) policy</title>
 636       <para>The target-based round-robin (TRR) policy performs batched round-robin scheduling of brw
 637         RPCs, with each batch consisting of RPCs that pertain to the same OST, as identified by its
 638         OST index.</para>
 639       <para>The TRR policy is identical to the object-based round-robin (ORR) policy, apart from
 640         using the brw RPC's target OST index instead of the backend-fs object's OST FID, for
 641         determining the RPC scheduling order. The goals of TRR are effectively the same as for ORR,
 642         and it uses the following tunables to adjust its behaviour:</para>
 643       <itemizedlist>
 644         <listitem>
 645           <para><literal>ost.OSS.ost_io.nrs_trr_quantum</literal></para>
 646           <para>The purpose of this tunable is exactly the same as for the <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR policy, and you can use it in exactly the same way.</para>
 647         </listitem>
 648         <listitem>
 649           <para><literal>ost.OSS.ost_io.nrs_trr_offset_type</literal></para>
 650           <para>The purpose of this tunable is exactly the same as for the <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the ORR policy, and you can use it in exactly the same way.</para>
 651         </listitem>
 652         <listitem>
 653           <para><literal>ost.OSS.ost_io.nrs_trr_supported</literal></para>
 654           <para>The purpose of this tunable is exactly the same as for the <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the ORR policy, and you can use it in exactly the sme way.</para>
 655         </listitem>
 656       </itemizedlist>
 657     </section>
 658   </section>
 659   <section xml:id="dbdoclet.50438272_25884">
 660       <title><indexterm><primary>tuning</primary><secondary>lockless I/O</secondary></indexterm>Lockless I/O Tunables</title>
 661     <para>The lockless I/O tunable feature allows servers to ask clients to do lockless I/O (liblustre-style where the server does the locking) on contended files.</para>
 662     <para>The lockless I/O patch introduces these tunables:</para>
 663     <itemizedlist>
 664       <listitem>
 665         <para><emphasis role="bold">OST-side:</emphasis></para>
 666         <screen>/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
 667 </screen>
 668         <para><literal>contended_locks</literal> - If the number of lock conflicts in the scan of granted and waiting queues at contended_locks is exceeded, the resource is considered to be contended.</para>
 669         <para><literal>contention_seconds</literal> - The resource keeps itself in a contended state as set in the parameter.</para>
 670         <para><literal>max_nolock_bytes</literal> - Server-side locking set only for requests less than the blocks set in the <literal>max_nolock_bytes</literal> parameter. If this tunable is set to zero (0), it disables server-side locking for read/write requests.</para>
 671       </listitem>
 672       <listitem>
 673         <para><emphasis role="bold">Client-side:</emphasis></para>
 674         <screen>/proc/fs/lustre/llite/lustre-*</screen>
 675         <para><literal>contention_seconds</literal> - <literal>llite</literal> inode remembers its contended state for the time specified in this parameter.</para>
 676       </listitem>
 677       <listitem>
 678         <para><emphasis role="bold">Client-side statistics:</emphasis></para>
 679         <para>The <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new rows for lockless I/O statistics.</para>
 680         <para><literal>lockless_read_bytes</literal> and <literal>lockless_write_bytes</literal> - To count the total bytes read or written, the client makes its own decisions based on the request size. The client does not communicate with the server if the request size is smaller than the <literal>min_nolock_size</literal>, without acquiring locks by the client.</para>
 681       </listitem>
 682     </itemizedlist>
 683   </section>
 684   <section xml:id="dbdoclet.50438272_80545">
 685     <title><indexterm>
 686         <primary>tuning</primary>
 687         <secondary>for small files</secondary>
 688       </indexterm>Improving Lustre File System Performance When Working with Small Files</title>
 689     <para>An environment where an application writes small file chunks from many clients to a single
 690       file will result in bad I/O performance. To improve the performance of the Lustre file system
 691       with small files:</para>
 692     <itemizedlist>
 693       <listitem>
 694         <para>Have the application aggregate writes some amount before submitting them to the Lustre
 695           file system. By default, the Lustre software enforces POSIX coherency semantics, so it
 696           results in lock ping-pong between client nodes if they are all writing to the same file at
 697           one time.</para>
 698       </listitem>
 699       <listitem>
 700         <para>Have the application do 4kB <literal>O_DIRECT</literal> sized I/O to the file and disable locking on the output file. This avoids partial-page IO submissions and, by disabling locking, you avoid contention between clients.</para>
 701       </listitem>
 702       <listitem>
 703         <para>Have the application write contiguous data.</para>
 704       </listitem>
 705       <listitem>
 706         <para>Add more disks or use SSD disks for the OSTs. This dramatically improves the IOPS rate. Consider creating larger OSTs rather than many smaller OSTs due to less overhead (journal, connections, etc).</para>
 707       </listitem>
 708       <listitem>
 709         <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity overhead for writing small chunks of data to disk.</para>
 710       </listitem>
 711     </itemizedlist>
 712   </section>
 713   <section xml:id="dbdoclet.50438272_45406">
 714     <title><indexterm><primary>tuning</primary><secondary>write performance</secondary></indexterm>Understanding Why Write Performance is Better Than Read Performance</title>
 715     <para>Typically, the performance of write operations on a Lustre cluster is better than read operations. When doing writes, all clients are sending write RPCs asynchronously. The RPCs are allocated, and written to disk in the order they arrive. In many cases, this allows the back-end storage to aggregate writes efficiently.</para>
 716     <para>In the case of read operations, the reads from clients may come in a different order and need a lot of seeking to get read from the disk. This noticeably hampers the read throughput.</para>
 717     <para>Currently, there is no readahead on the OSTs themselves, though the clients do readahead. If there are lots of clients doing reads it would not be possible to do any readahead in any case because of memory consumption (consider that even a single RPC (1 MB) readahead for 1000 clients would consume 1 GB of RAM).</para>
 718     <para>For file systems that use socklnd (TCP, Ethernet) as interconnect, there is also additional CPU overhead because the client cannot receive data without copying it from the network buffers. In the write case, the client CAN send data without the additional data copy. This means that the client is more likely to become CPU-bound during reads than writes.</para>
 719   </section>
 720 </chapter>