LUDOC-11 misc: remove references for older Lustre

[doc/manual.git] / LustreTuning.xml
diff --git a/LustreTuning.xml b/LustreTuning.xml

index 37fc0f6..91e5ae8 100644 (file)
--- a/LustreTuning.xml
+++ b/LustreTuning.xml
@@ -1,42 +1,31 @@
-<?xml version='1.0' encoding='UTF-8'?>
-<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretuning">
-  <title xml:id="lustretuning.title">Lustre Tuning</title>
-  <para>This chapter contains information about tuning Lustre for better performance and includes the following sections:</para>
-  <itemizedlist>
-    <listitem>
-      <para><xref linkend="dbdoclet.50438272_55226"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.mdstuning"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.50438272_73839"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.libcfstuning"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.lndtuning"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.50438272_25884"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.50438272_80545"/></para>
-    </listitem>
-    <listitem>
-      <para><xref linkend="dbdoclet.50438272_45406"/></para>
-    </listitem>
-  </itemizedlist>
+<?xml version='1.0' encoding='utf-8'?>
+<chapter xmlns="http://docbook.org/ns/docbook"
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="lustretuning">
+  <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
+  <para>This chapter contains information about tuning a Lustre file system for
+  better performance.</para>
    <note>
    <note>
-    <para>Many options in Lustre are set by means of kernel module parameters. These parameters are contained in the <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
+    <para>Many options in the Lustre software are set by means of kernel module
+    parameters. These parameters are contained in the 
+    <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
    </note>
    <section xml:id="dbdoclet.50438272_55226">
    </note>
    <section xml:id="dbdoclet.50438272_55226">
-      <title>
-          <indexterm><primary>tuning</primary></indexterm>
-<indexterm><primary>tuning</primary><secondary>service threads</secondary></indexterm>
-          Optimizing the Number of Service Threads</title>
-    <para>An OSS can have a minimum of 2 service threads and a maximum of 512 service threads. The number of service threads is a function of how much RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the OSS node is high, new service threads will be started in order to process more requests concurrently, up to 4x the initial number of threads (subject to the maximum of 512). For a 2GB 2-CPU system, the default thread count is 32 and the maximum thread count is 128.</para>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+    </indexterm>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>service threads</secondary>
+    </indexterm>Optimizing the Number of Service Threads</title>
+    <para>An OSS can have a minimum of two service threads and a maximum of 512
+    service threads. The number of service threads is a function of how much
+    RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
+    If the load on the OSS node is high, new service threads will be started in
+    order to process more requests concurrently, up to 4x the initial number of
+    threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
+    default thread count is 32 and the maximum thread count is 128.</para>
      <para>Increasing the size of the thread pool may help when:</para>
      <itemizedlist>
        <listitem>
      <para>Increasing the size of the thread pool may help when:</para>
      <itemizedlist>
        <listitem>
@@ -55,213 +44,2373 @@
          <para>Clients are overwhelming the storage capacity</para>
        </listitem>
        <listitem>
          <para>Clients are overwhelming the storage capacity</para>
        </listitem>
        <listitem>
-        <para>There are lots of &quot;slow I/O&quot; or similar messages</para>
+        <para>There are lots of "slow I/O" or similar messages</para>
        </listitem>
      </itemizedlist>
        </listitem>
      </itemizedlist>
-    <para>Increasing the number of I/O threads allows the kernel and storage to aggregate many writes together for more efficient disk I/O. The OSS thread pool is shared--each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal I/O buffers.</para>
-    <para>It is very important to consider memory consumption when increasing the thread pool size. Drives are only able to sustain a certain amount of parallel I/O activity before performance is degraded, due to the high number of seeks and the OST threads just waiting for I/O. In this situation, it may be advisable to decrease the load by decreasing the number of OST threads.</para>
-    <para>Determining the optimum number of OST threads is a process of trial and error, and varies for each particular configuration. Variables include the number of OSTs on each OSS, number and speed of disks, RAID configuration, and available RAM. You may want to start with a number of OST threads equal to the number of actual disk spindles on the node. If you use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and monitor the performance of clients during usual workloads. If performance is degraded, increase the thread count and see how that works until performance is degraded again or you reach satisfactory performance.</para>
+    <para>Increasing the number of I/O threads allows the kernel and storage to
+    aggregate many writes together for more efficient disk I/O. The OSS thread
+    pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
+    size + 0.5 MB) for internal I/O buffers.</para>
+    <para>It is very important to consider memory consumption when increasing
+    the thread pool size. Drives are only able to sustain a certain amount of
+    parallel I/O activity before performance is degraded, due to the high
+    number of seeks and the OST threads just waiting for I/O. In this
+    situation, it may be advisable to decrease the load by decreasing the
+    number of OST threads.</para>
+    <para>Determining the optimum number of OSS threads is a process of trial
+    and error, and varies for each particular configuration. Variables include
+    the number of OSTs on each OSS, number and speed of disks, RAID
+    configuration, and available RAM. You may want to start with a number of
+    OST threads equal to the number of actual disk spindles on the node. If you
+    use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
+    of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
+    performance of clients during usual workloads. If performance is degraded,
+    increase the thread count and see how that works until performance is
+    degraded again or you reach satisfactory performance.</para>
      <note>
      <note>
-      <para>If there are too many threads, the latency for individual I/O requests can become very high and should be avoided. Set the desired maximum thread count permanently using the method described above.</para>
+      <para>If there are too many threads, the latency for individual I/O
+      requests can become very high and should be avoided. Set the desired
+      maximum thread count permanently using the method described above.</para>
      </note>
      <section>
      </note>
      <section>
-      <title><indexterm><primary>tuning</primary><secondary>OSS threads</secondary></indexterm>Specifying the OSS Service Thread Count</title>
-      <para>The <literal>oss_num_threads</literal> parameter enables the number of OST service threads to be specified at module load time on the OSS nodes:</para>
-      <screen>options ost oss_num_threads={N}</screen>
-      <para>After startup, the minimum and maximum number of OSS thread counts can be set via the <literal>{service}.thread_{min,max,started}</literal> tunable. To change the tunable at runtime, run:</para>
-      <para><screen>lctl {get,set}_param {service}.thread_{min,max,started}</screen></para>
-      <para>Lustre 2.3 introduced binding service threads to CPU partition. This works in a similar fashion to binding of threads on MDS. MDS thread tuning is covered in <xref linkend="dbdoclet.mdsbinding"/>.</para>
-    <itemizedlist>
-      <listitem>
-        <para><literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service on CPTs defined by <literal>[EXPRESSION]</literal>.</para>
-      </listitem>
-      <listitem>
-        <para><literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service on CPTs defined by <literal>[EXPRESSION]</literal>.</para>
-      </listitem>
-    </itemizedlist>
-
-      <para>For further details, see <xref linkend="dbdoclet.50438271_87260"/>.</para>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>OSS threads</secondary>
+      </indexterm>Specifying the OSS Service Thread Count</title>
+      <para>The 
+      <literal>oss_num_threads</literal> parameter enables the number of OST
+      service threads to be specified at module load time on the OSS
+      nodes:</para>
+      <screen>
+options ost oss_num_threads={N}
+</screen>
+      <para>After startup, the minimum and maximum number of OSS thread counts
+      can be set via the 
+      <literal>{service}.thread_{min,max,started}</literal> tunable. To change
+      the tunable at runtime, run:</para>
+      <para>
+        <screen>
+lctl {get,set}_param {service}.thread_{min,max,started}
+</screen>
+      </para>
+      <para>
+      This works in a similar fashion to 
+      binding of threads on MDS. MDS thread tuning is covered in 
+      <xref linkend="dbdoclet.mdsbinding" />.</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+          <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
+          on CPTs defined by 
+          <literal>[EXPRESSION]</literal>.</para>
+        </listitem>
+        <listitem>
+          <para>
+          <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
+          on CPTs defined by 
+          <literal>[EXPRESSION]</literal>.</para>
+        </listitem>
+      </itemizedlist>
+      <para>For further details, see 
+      <xref linkend="dbdoclet.50438271_87260" />.</para>
      </section>
      <section xml:id="dbdoclet.mdstuning">
      </section>
      <section xml:id="dbdoclet.mdstuning">
-      <title><indexterm><primary>tuning</primary><secondary>MDS threads</secondary></indexterm>Specifying the MDS Service Thread Count</title>
-      <para>The <literal>mds_num_threads</literal> parameter enables the number of MDS service threads to be specified at module load time on the MDS node:</para>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>MDS threads</secondary>
+      </indexterm>Specifying the MDS Service Thread Count</title>
+      <para>The 
+      <literal>mds_num_threads</literal> parameter enables the number of MDS
+      service threads to be specified at module load time on the MDS
+      node:</para>
        <screen>options mds mds_num_threads={N}</screen>
        <screen>options mds mds_num_threads={N}</screen>
-      <para>After startup, the minimum and maximum number of MDS thread counts can be set via the <literal>{service}.thread_{min,max,started}</literal> tunable. To change the tunable at runtime, run:</para>
-      <para><screen>lctl {get,set}_param {service}.thread_{min,max,started}</screen></para>
-      <para>For details, see <xref linkend="dbdoclet.50438271_87260"/>.</para>
-      <para>At this time, no testing has been done to determine the optimal number of MDS threads. The default value varies, based on server size, up to a maximum of 32. The maximum number of threads (<literal>MDS_MAX_THREADS</literal>) is 512.</para>
+      <para>After startup, the minimum and maximum number of MDS thread counts
+      can be set via the 
+      <literal>{service}.thread_{min,max,started}</literal> tunable. To change
+      the tunable at runtime, run:</para>
+      <para>
+        <screen>
+lctl {get,set}_param {service}.thread_{min,max,started}
+</screen>
+      </para>
+      <para>For details, see 
+      <xref linkend="dbdoclet.50438271_87260" />.</para>
+      <para>The number of MDS service threads started depends on system size
+      and the load on the server, and has a default maximum of 64. The
+      maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
+      is 1024.</para>
        <note>
        <note>
-        <para>The OSS and MDS automatically start new service threads dynamically, in response to server load within a factor of 4. The default value is calculated the same way as before. Setting the <literal>_mu_threads</literal> module parameter disables automatic thread creation behavior.</para>
+        <para>The OSS and MDS start two threads per service per CPT at mount
+       time, and dynamically increase the number of running service threads in
+       response to server load. Setting the <literal>*_num_threads</literal>
+       module parameter starts the specified number of threads for that
+       service immediately and disables automatic thread creation behavior.
+       </para>
        </note>
        </note>
-       <para>Lustre 2.3 introduced new parameters to provide more control to administrators.</para>
-           <itemizedlist>
-             <listitem>
-               <para><literal>mds_rdpg_num_threads</literal> controls the number of threads in providing the read page service. The read page service handles file close and readdir operations.</para>
-             </listitem>
-             <listitem>
-               <para><literal>mds_attr_num_threads</literal> controls the number of threads in providing the setattr service to 1.8 clients.</para>
-             </listitem>
-           </itemizedlist>
-       <note><para>Default values for the thread counts are automatically selected. The values are chosen to best exploit the number of CPUs present in the system and to provide best overall performance for typical workloads.</para></note>
+      <para>Parameters are available to provide administrators control
+        over the number of service threads.</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+          <literal>mds_rdpg_num_threads</literal> controls the number of threads
+          in providing the read page service. The read page service handles
+          file close and readdir operations.</para>
+        </listitem>
+      </itemizedlist>
      </section>
    </section>
      </section>
    </section>
-    <section xml:id="dbdoclet.mdsbinding" condition='l24'>
-      <title><indexterm><primary>tuning</primary><secondary>MDS binding</secondary></indexterm>Binding MDS Service Thread to CPU Partitions</title>
-       <para>With the introduction of Node Affinity (<xref linkend="nodeaffdef"/>) in Lustre 2.3, MDS threads can be bound to particular CPU Partitions (CPTs). Default values for bindings are selected automatically to provide good overall performance for a given CPU count. However, an administrator can deviate from these setting if they choose.</para>
-           <itemizedlist>
-             <listitem>
-               <para><literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS service threads to CPTs defined by <literal>EXPRESSION</literal>. For example <literal>mdt_num_cpts=[0-3]</literal> will bind the MDS service threads to <literal>CPT[0,1,2,3]</literal>.</para>
-             </listitem>
-             <listitem>
-               <para><literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page service threads to CPTs defined by <literal>EXPRESSION</literal>. The read page service handles file close and readdir requests. For example <literal>mdt_rdpg_num_cpts=[4]</literal> will bind the read page threads to <literal>CPT4</literal>.</para>
-             </listitem>
-             <listitem>
-               <para><literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr service threads to CPTs defined by <literal>EXPRESSION</literal>.</para>
-             </listitem>
-           </itemizedlist>
+  <section xml:id="dbdoclet.mdsbinding">
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>MDS binding</secondary>
+    </indexterm>Binding MDS Service Thread to CPU Partitions</title>
+    <para>With the Node Affinity (<xref linkend="nodeaffdef" />) feature,
+    MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU
+    cache usage and memory locality.  Default values for CPT counts and CPU core
+    bindings are selected automatically to provide good overall performance for
+    a given CPU count. However, an administrator can deviate from these setting
+    if they choose.  For details on specifying the mapping of CPU cores to
+    CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
+    </para>
+    <itemizedlist>
+      <listitem>
+        <para>
+        <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
+        service threads to CPTs defined by 
+        <literal>EXPRESSION</literal>. For example 
+        <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
+        to 
+        <literal>CPT[0,1,2,3]</literal>.</para>
+      </listitem>
+      <listitem>
+        <para>
+        <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
+        service threads to CPTs defined by 
+        <literal>EXPRESSION</literal>. The read page service handles file close
+        and readdir requests. For example 
+        <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
+        to 
+        <literal>CPT4</literal>.</para>
+      </listitem>
+    </itemizedlist>
+    <para>Parameters must be set before module load in the file 
+    <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
+    <example><title>lustre.conf</title>
+    <screen>options lnet networks=tcp0(eth0)
+options mdt mds_num_cpts=[0]</screen>
+    </example>
+    </para>
    </section>
    <section xml:id="dbdoclet.50438272_73839">
    </section>
    <section xml:id="dbdoclet.50438272_73839">
-      <title>
-          <indexterm><primary>LNET</primary><secondary>tuning</secondary>
-      </indexterm><indexterm><primary>tuning</primary><secondary>LNET</secondary></indexterm>Tuning LNET Parameters</title>
-    <para>This section describes LNET tunables. that may be necessary on some systems to improve performance. To test the performance of your Lustre network, see <link xl:href="LNETSelfTest.html#50438223_71556">Chapter 23</link>: <link xl:href="LNETSelfTest.html#50438223_21832">Testing Lustre Network Performance (LNET Self-Test)</link>.</para>
+    <title>
+    <indexterm>
+      <primary>LNet</primary>
+      <secondary>tuning</secondary>
+    </indexterm>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>LNet</secondary>
+    </indexterm>Tuning LNet Parameters</title>
+    <para>This section describes LNet tunables, the use of which may be
+    necessary on some systems to improve performance. To test the performance
+    of your Lustre network, see 
+    <xref linkend='lnetselftest' />.</para>
      <section remap="h3">
        <title>Transmit and Receive Buffer Size</title>
      <section remap="h3">
        <title>Transmit and Receive Buffer Size</title>
-      <para>The kernel allocates buffers for sending and receiving messages on a network.</para>
-      <para><literal>ksocklnd</literal> has separate parameters for the transmit and receive buffers.</para>
-      <screen>options ksocklnd tx_buffer_size=0 rx_buffer_size=0
+      <para>The kernel allocates buffers for sending and receiving messages on
+      a network.</para>
+      <para>
+      <literal>ksocklnd</literal> has separate parameters for the transmit and
+      receive buffers.</para>
+      <screen>
+options ksocklnd tx_buffer_size=0 rx_buffer_size=0
  </screen>
  </screen>
-      <para>If these parameters are left at the default value (0), the system automatically tunes the transmit and receive buffer size. In almost every case, this default produces the best performance. Do not attempt to tune these parameters unless you are a network expert.</para>
+      <para>If these parameters are left at the default value (0), the system
+      automatically tunes the transmit and receive buffer size. In almost every
+      case, this default produces the best performance. Do not attempt to tune
+      these parameters unless you are a network expert.</para>
      </section>
      <section remap="h3">
      </section>
      <section remap="h3">
-      <title>Hardware Interrupts (<literal>enable_irq_affinity</literal>)</title>
-      <para>The hardware interrupts that are generated by network adapters may be handled by any CPU in the system. In some cases, we would like network traffic to remain local to a single CPU to help keep the processor cache warm and minimize the impact of context switches. This is helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs. To enable the <literal>enable_irq_affinity</literal> parameter, enter:</para>
-      <screen>options ksocklnd enable_irq_affinity=1</screen>
-      <para>In other cases, if you have an SMP platform with a single fast interface such as 10Gb Ethernet and more than two CPUs, you may see performance improve by turning this parameter off.</para>
-      <screen>options ksocklnd enable_irq_affinity=0</screen>
-      <para>By default, this parameter is off. As always, you should test the performance to compare the impact of changing this parameter.</para>
+      <title>Hardware Interrupts (
+      <literal>enable_irq_affinity</literal>)</title>
+      <para>The hardware interrupts that are generated by network adapters may
+      be handled by any CPU in the system. In some cases, we would like network
+      traffic to remain local to a single CPU to help keep the processor cache
+      warm and minimize the impact of context switches. This is helpful when an
+      SMP system has more than one network interface and ideal when the number
+      of interfaces equals the number of CPUs. To enable the 
+      <literal>enable_irq_affinity</literal> parameter, enter:</para>
+      <screen>
+options ksocklnd enable_irq_affinity=1
+</screen>
+      <para>In other cases, if you have an SMP platform with a single fast
+      interface such as 10 Gb Ethernet and more than two CPUs, you may see
+      performance improve by turning this parameter off.</para>
+      <screen>
+options ksocklnd enable_irq_affinity=0
+</screen>
+      <para>By default, this parameter is off. As always, you should test the
+      performance to compare the impact of changing this parameter.</para>
      </section>
      </section>
-       <section><title><indexterm><primary>tuning</primary><secondary>Network interface binding</secondary></indexterm>Binding Network Interface Against CPU Partitions</title>
-       <para>Luster 2.3 and beyond provide enhanced network interface control. The enhancement means that an administrator can bind an interface to one or more CPU Partitions. Bindings are specified as options to the lnet modules. For more information on specifying module options, see <xref linkend="dbdoclet.50438293_15350"/></para>
-<para>For example, <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for <literal>o2ib0</literal> will be handled by LND threads executing on <literal>CPT0</literal> and <literal>CPT1</literal>. An additional example might be: <literal>tcp1(eth0)[0]</literal>. Messages for <literal>tcp1</literal> are handled by threads on <literal>CPT0</literal>.</para>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network interface binding</secondary>
+      </indexterm>Binding Network Interface Against CPU Partitions</title>
+      <para>Lustre allows enhanced network interface control. This means that
+      an administrator can bind an interface to one or more CPU partitions.
+      Bindings are specified as options to the LNet modules. For more
+      information on specifying module options, see 
+      <xref linkend="dbdoclet.50438293_15350" /></para>
+      <para>For example, 
+      <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for 
+      <literal>o2ib0</literal> will be handled by LND threads executing on 
+      <literal>CPT0</literal> and 
+      <literal>CPT1</literal>. An additional example might be: 
+      <literal>tcp1(eth0)[0]</literal>. Messages for 
+      <literal>tcp1</literal> are handled by threads on 
+      <literal>CPT0</literal>.</para>
      </section>
      </section>
-       <section><title><indexterm><primary>tuning</primary><secondary>Network interface credits</secondary></indexterm>Network Interface Credits</title>
-       <para>Network interface (NI) credits are shared across all CPU partitions (CPT). For example, a machine has 4 CPTs and NI credits is 512, then each partition will has 128 credits. If a large number of CPTs exist on the system,  LNet will check and validate the NI credits value for each CPT to ensure each CPT has workable number of credits. For example, a machine has 16 CPTs and NI credits is set to 256, then each partition only has 16 credits. 16 NI credits is low and could negatively impact performance. As a result, LNet will automatically make an adjustment to 8*peer_credits (peer_credits is 8 by default), so credits for each partition is still 64.</para>
-       <para>Modifying the NI Credit count can be performed by an administrator using <literal>ksoclnd</literal> or <literal>ko2iblnd</literal>. For example:</para>
-       <screen>ksocklnd credits=256</screen>
-       <para>applies 256 credits to TCP connections. Applying 256 credits to IB connections can be achieved with:</para>
-       <screen>ko2iblnd credits=256</screen>
-       <note><para>From Lustre 2.3 and beyond, it is possible that LNet may revalidate the NI Credits and the administrator's request do not persist.</para></note>
-       </section>
-       <section><title><indexterm><primary>tuning</primary><secondary>router buffers</secondary></indexterm>Router Buffers</title>
-       <para>Router buffers are shared by all CPU partitions. For a machine with a large number of CPTs, the router buffer number may need to be specified manually for best performance. A low number of router buffers risks starving the CPU Partitions of resources.</para>
-       <para>The default setting for router buffers will typically perform well. LNet automatically sets a default value to reduce the likelihood of resource starvation</para>
-       <para>An administrator may modify router buffers using the <literal>large_router_buffers</literal> parameter. For example:</para>
-       <screen>lnet large_router_buffers=8192</screen>
-       <note><para>From Lustre 2.3 and beyond, it is possible that LNet may revalidate the router buffer setting and the administrator's request do not persist.</para></note>
-       </section>
-       <section><title><indexterm><primary>tuning</primary><secondary>portal round-robin</secondary></indexterm>Portal Round-Robin</title>
-       <para>Portal round-robin defines the policy LNet applies to deliver events and messages to the upper layers. The upper layers are ptlrpc service or LNet selftest.</para>
-       <para>If portal round-robin is disabled, LNet will deliver messages to CPTs based on a hash of the source NID. Hence, all messages from a specific peer will be handled by the same CPT. This can reduce data traffic between CPUs. However, for some workloads, this behavior may result in poorly balancing loads across the CPU.</para>
-       <para>If portal round-robin is enabled, LNet will round-robin incoming events across all CPTs. This may balance load better across the CPU but can incur a cross CPU overhead.</para>
-       <para>The current policy can be changed by an administrator with <literal>echo &lt;VALUE&gt; &gt; /proc/sys/lnet/portal_rotor</literal>. There are four options for <literal>&lt;VALUE&gt;</literal>:</para>
-    <itemizedlist>
-      <listitem>
-        <para><literal>OFF</literal></para>
-               <para>Disable portal round-robin on all incoming requests.</para>
-      </listitem>
-      <listitem>
-        <para><literal>ON</literal></para>
-               <para>Enable portal round-robin on all incoming requests.</para>
-      </listitem>
-      <listitem>
-        <para><literal>RR_RT</literal></para>
-               <para>Enable portal round-robin only for routed messages.</para>
-      </listitem>
-      <listitem>
-        <para><literal>HASH_RT</literal></para>
-               <para>Routed messages will be delivered to the upper layer by hash of source NID (instead of NID of router.) This is the default value.</para>
-      </listitem>
-    </itemizedlist>
-
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network interface credits</secondary>
+      </indexterm>Network Interface Credits</title>
+      <para>Network interface (NI) credits are shared across all CPU partitions
+      (CPT). For example, if a machine has four CPTs and the number of NI
+      credits is 512, then each partition has 128 credits. If a large number of
+      CPTs exist on the system, LNet checks and validates the NI credits for
+      each CPT to ensure each CPT has a workable number of credits. For
+      example, if a machine has 16 CPTs and the number of NI credits is 256,
+      then each partition only has 16 credits. 16 NI credits is low and could
+      negatively impact performance. As a result, LNet automatically adjusts
+      the credits to 8*
+      <literal>peer_credits</literal>(
+      <literal>peer_credits</literal> is 8 by default), so each partition has 64
+      credits.</para>
+      <para>Increasing the number of 
+      <literal>credits</literal>/
+      <literal>peer_credits</literal> can improve the performance of high
+      latency networks (at the cost of consuming more memory) by enabling LNet
+      to send more inflight messages to a specific network/peer and keep the
+      pipeline saturated.</para>
+      <para>An administrator can modify the NI credit count using 
+      <literal>ksoclnd</literal> or 
+      <literal>ko2iblnd</literal>. In the example below, 256 credits are
+      applied to TCP connections.</para>
+      <screen>
+ksocklnd credits=256
+</screen>
+      <para>Applying 256 credits to IB connections can be achieved with:</para>
+      <screen>
+ko2iblnd credits=256
+</screen>
+      <note>
+        <para>LNet may revalidate the NI credits, so the administrator's
+       request may not persist.</para>
+      </note>
+    </section>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>router buffers</secondary>
+      </indexterm>Router Buffers</title>
+      <para>When a node is set up as an LNet router, three pools of buffers are
+      allocated: tiny, small and large. These pools are allocated per CPU
+      partition and are used to buffer messages that arrive at the router to be
+      forwarded to the next hop. The three different buffer sizes accommodate
+      different size messages.</para>
+      <para>If a message arrives that can fit in a tiny buffer then a tiny
+      buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
+      small buffer, then a small buffer is used. Finally if a message does not
+      fit in either a tiny buffer or a small buffer, a large buffer is
+      used.</para>
+      <para>Router buffers are shared by all CPU partitions. For a machine with
+      a large number of CPTs, the router buffer number may need to be specified
+      manually for best performance. A low number of router buffers risks
+      starving the CPU partitions of resources.</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+          <literal>tiny_router_buffers</literal>: Zero payload buffers used for
+          signals and acknowledgements.</para>
+        </listitem>
+        <listitem>
+          <para>
+          <literal>small_router_buffers</literal>: 4 KB payload buffers for
+          small messages</para>
+        </listitem>
+        <listitem>
+          <para>
+          <literal>large_router_buffers</literal>: 1 MB maximum payload
+          buffers, corresponding to the recommended RPC size of 1 MB.</para>
+        </listitem>
+      </itemizedlist>
+      <para>The default setting for router buffers typically results in
+      acceptable performance. LNet automatically sets a default value to reduce
+      the likelihood of resource starvation. The size of a router buffer can be
+      modified as shown in the example below. In this example, the size of the
+      large buffer is modified using the 
+      <literal>large_router_buffers</literal> parameter.</para>
+      <screen>
+lnet large_router_buffers=8192
+</screen>
+      <note>
+        <para>LNet may revalidate the router buffer setting, so the
+       administrator's request may not persist.</para>
+      </note>
+    </section>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>portal round-robin</secondary>
+      </indexterm>Portal Round-Robin</title>
+      <para>Portal round-robin defines the policy LNet applies to deliver
+      events and messages to the upper layers. The upper layers are PLRPC
+      service or LNet selftest.</para>
+      <para>If portal round-robin is disabled, LNet will deliver messages to
+      CPTs based on a hash of the source NID. Hence, all messages from a
+      specific peer will be handled by the same CPT. This can reduce data
+      traffic between CPUs. However, for some workloads, this behavior may
+      result in poorly balancing loads across the CPU.</para>
+      <para>If portal round-robin is enabled, LNet will round-robin incoming
+      events across all CPTs. This may balance load better across the CPU but
+      can incur a cross CPU overhead.</para>
+      <para>The current policy can be changed by an administrator with 
+      <literal>echo 
+      <replaceable>value</replaceable>&gt;
+      /proc/sys/lnet/portal_rotor</literal>. There are four options for 
+      <literal>
+        <replaceable>value</replaceable>
+      </literal>:</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>OFF</literal>
+          </para>
+          <para>Disable portal round-robin on all incoming requests.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>ON</literal>
+          </para>
+          <para>Enable portal round-robin on all incoming requests.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>RR_RT</literal>
+          </para>
+          <para>Enable portal round-robin only for routed messages.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>HASH_RT</literal>
+          </para>
+          <para>Routed messages will be delivered to the upper layer by hash of
+          source NID (instead of NID of router.) This is the default
+          value.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section>
+      <title>LNet Peer Health</title>
+      <para>Two options are available to help determine peer health:
+      <itemizedlist>
+        <listitem>
+          <para>
+          <literal>peer_timeout</literal>- The timeout (in seconds) before an
+          aliveness query is sent to a peer. For example, if 
+          <literal>peer_timeout</literal> is set to 
+          <literal>180sec</literal>, an aliveness query is sent to the peer
+          every 180 seconds. This feature only takes effect if the node is
+          configured as an LNet router.</para>
+          <para>In a routed environment, the 
+          <literal>peer_timeout</literal> feature should always be on (set to a
+          value in seconds) on routers. If the router checker has been enabled,
+          the feature should be turned off by setting it to 0 on clients and
+          servers.</para>
+          <para>For a non-routed scenario, enabling the 
+          <literal>peer_timeout</literal> option provides health information
+          such as whether a peer is alive or not. For example, a client is able
+          to determine if an MGS or OST is up when it sends it a message. If a
+          response is received, the peer is alive; otherwise a timeout occurs
+          when the request is made.</para>
+          <para>In general, 
+          <literal>peer_timeout</literal> should be set to no less than the LND
+          timeout setting. For more information about LND timeouts, see 
+          <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+          linkend="section_c24_nt5_dl" />.</para>
+          <para>When the 
+          <literal>o2iblnd</literal>(IB) driver is used, 
+          <literal>peer_timeout</literal> should be at least twice the value of
+          the 
+          <literal>ko2iblnd</literal> keepalive option. for more information
+          about keepalive options, see 
+          <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+          linkend="section_ngq_qhy_zl" />.</para>
+        </listitem>
+        <listitem>
+          <para>
+          <literal>avoid_asym_router_failure</literal>– When set to 1, the
+          router checker running on the client or a server periodically pings
+          all the routers corresponding to the NIDs identified in the routes
+          parameter setting on the node to determine the status of each router
+          interface. The default setting is 1. (For more information about the
+          LNet routes parameter, see 
+          <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+          linkend="lnet_module_routes" /></para>
+          <para>A router is considered down if any of its NIDs are down. For
+          example, router X has three NIDs: 
+          <literal>Xnid1</literal>, 
+          <literal>Xnid2</literal>, and 
+          <literal>Xnid3</literal>. A client is connected to the router via 
+          <literal>Xnid1</literal>. The client has router checker enabled. The
+          router checker periodically sends a ping to the router via 
+          <literal>Xnid1</literal>. The router responds to the ping with the
+          status of each of its NIDs. In this case, it responds with 
+          <literal>Xnid1=up</literal>, 
+          <literal>Xnid2=up</literal>, 
+          <literal>Xnid3=down</literal>. If 
+          <literal>avoid_asym_router_failure==1</literal>, the router is
+          considered down if any of its NIDs are down, so router X is
+          considered down and will not be used for routing messages. If 
+          <literal>avoid_asym_router_failure==0</literal>, router X will
+          continue to be used for routing messages.</para>
+        </listitem>
+      </itemizedlist></para>
+      <para>The following router checker parameters must be set to the maximum
+      value of the corresponding setting for this option on any client or
+      server:
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>dead_router_check_interval</literal>
+          </para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>live_router_check_interval</literal>
+          </para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>router_ping_timeout</literal>
+          </para>
+        </listitem>
+      </itemizedlist></para>
+      <para>For example, the 
+      <literal>dead_router_check_interval</literal> parameter on any router must
+      be MAX.</para>
      </section>
    </section>
    <section xml:id="dbdoclet.libcfstuning">
      </section>
    </section>
    <section xml:id="dbdoclet.libcfstuning">
-      <title><indexterm><primary>tuning</primary><secondary>libcfs</secondary></indexterm>libcfs Tuning</title>
-<para>By default, Lustre will automatically generate CPU Partitions (CPT) based on the number of CPUs in the system. The CPT number will be 1 if the online CPU number is less than five.</para>
-        <para>The CPT number can be explicitly set on the libcfs module using <literal>cpu_npartitions=NUMBER</literal>. The value of <literal>cpu_npartitions</literal> must be an integer between 1 and the number of online CPUs.</para>
-<tip><para>Setting CPT to 1 will disable most of the SMP Node Affinity functionality.</para></tip>
-        <section>
-                <title>CPU Partition String Patterns</title>
-        <para>CPU Partitions can be described using string pattern notation. For example:</para>
-    <itemizedlist>
-      <listitem>
-        <para><literal>cpu_pattern="0[0,2,4,6] 1[1,3,5,7]</literal></para>
-                <para>Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains CPU[1,3,5,7].</para>
-      </listitem>
-      <listitem> <para><literal>cpu_pattern="N 0[0-3] 1[4-7]</literal></para>
-                <para>Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1 contains all CPUs in NUMA node [4-7].</para>
-      </listitem>
-    </itemizedlist>
-        <para>The current configuration of the CPU partition can be read from <literal>/proc/sys/lnet/cpu_paratitions</literal></para>
-        </section>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>libcfs</secondary>
+    </indexterm>libcfs Tuning</title>
+    <para>Lustre allows binding service threads via CPU Partition Tables
+      (CPTs). This allows the system administrator to fine-tune on which CPU
+      cores the Lustre service threads are run, for both OSS and MDS services,
+      as well as on the client.
+    </para>
+    <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
+    system functions such as system monitoring, HA heartbeat, or similar
+    tasks.  On the client it may be useful to restrict Lustre RPC service
+    threads to a small subset of cores so that they do not interfere with
+    computation, or because these cores are directly attached to the network
+    interfaces.
+    </para>
+    <para>By default, the Lustre software will automatically generate CPU
+    partitions (CPT) based on the number of CPUs in the system.
+    The CPT count can be explicitly set on the libcfs module using 
+    <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
+    The value of <literal>cpu_npartitions</literal> must be an integer between
+    1 and the number of online CPUs.
+    </para>
+    <para condition='l29'>In Lustre 2.9 and later the default is to use
+    one CPT per NUMA node.  In earlier versions of Lustre, by default there
+    was a single CPT if the online CPU core count was four or fewer, and
+    additional CPTs would be created depending on the number of CPU cores,
+    typically with 4-8 cores per CPT.
+    </para>
+    <tip>
+      <para>Setting <literal>cpu_npartitions=1</literal> will disable most
+      of the SMP Node Affinity functionality.</para>
+    </tip>
+    <section>
+      <title>CPU Partition String Patterns</title>
+      <para>CPU partitions can be described using string pattern notation.
+      If <literal>cpu_pattern=N</literal> is used, then there will be one
+      CPT for each NUMA node in the system, with each CPT mapping all of
+      the CPU cores for that NUMA node.
+      </para>
+      <para>It is also possible to explicitly specify the mapping between
+      CPU cores and CPTs, for example:</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
+          </para>
+          <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
+         contains cores 3, 5, 7.  CPU cores 0 and 1 will not be used by Lustre
+         service threads, and could be used for node services such as
+         system monitoring, HA heartbeat threads, etc.  The binding of
+         non-Lustre services to those CPU cores may be done in userspace
+         using <literal>numactl(8)</literal> or other application-specific
+         methods, but is beyond the scope of this document.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
+          </para>
+          <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
+         node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
+        </listitem>
+      </itemizedlist>
+      <para>The current configuration of the CPU partition can be read via 
+      <literal>lctl get_parm cpu_partition_table</literal>.  For example,
+      a simple 4-core system has a single CPT with all four CPU cores:
+      <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=0  : 0 1 2 3</screen>
+      while a larger NUMA system with four 12-core CPUs may have four CPTs:
+      <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=
+0      : 0 1 2 3 4 5 6 7 8 9 10 11
+1      : 12 13 14 15 16 17 18 19 20 21 22 23
+2      : 24 25 26 27 28 29 30 31 32 33 34 35
+3      : 36 37 38 39 40 41 42 43 44 45 46 47
+</screen>
+      </para>
+    </section>
    </section>
    <section xml:id="dbdoclet.lndtuning">
    </section>
    <section xml:id="dbdoclet.lndtuning">
-      <title><indexterm><primary>tuning</primary><secondary>LND tuning</secondary></indexterm>LND Tuning</title>
-      <para>LND tuning allows the number of threads per CPU partition to be specified. An administrator can set the threads for both <literal>ko2iblnd</literal> and <literal>ksocklnd</literal> using the <literal>nscheds</literal> parameter. This adjusts the number of threads for each partition, not the overall number of threads on the LND.</para>
-                <note><para>Lustre 2.3 has greatly decreased the default number of threads for <literal>ko2iblnd</literal> and <literal>ksocklnd</literal> on high-core count machines. The current default values are automatically set and are chosen to work well across a number of typical scenarios.</para></note>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>LND tuning</secondary>
+    </indexterm>LND Tuning</title>
+    <para>LND tuning allows the number of threads per CPU partition to be
+    specified. An administrator can set the threads for both 
+    <literal>ko2iblnd</literal> and 
+    <literal>ksocklnd</literal> using the 
+    <literal>nscheds</literal> parameter. This adjusts the number of threads for
+    each partition, not the overall number of threads on the LND.</para>
+    <note>
+      <para>The default number of threads for 
+      <literal>ko2iblnd</literal> and 
+      <literal>ksocklnd</literal> are automatically set and are chosen to
+      work well across a number of typical scenarios, for systems with both
+      high and low core counts.</para>
+    </note>
+    <section>
+       <title>ko2iblnd Tuning</title>
+       <para>The following table outlines the ko2iblnd module parameters to be used
+    for tuning:</para>
+       <informaltable frame="all">
+         <tgroup cols="3">
+           <colspec colname="c1" colwidth="50*" />
+           <colspec colname="c2" colwidth="50*" />
+           <colspec colname="c3" colwidth="50*" />
+           <thead>
+             <row>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Module Parameter</emphasis>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Default Value</emphasis>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Description</emphasis>
+                 </para>
+               </entry>
+             </row>
+           </thead>
+           <tbody>
+             <row>
+               <entry>
+                 <para>
+                   <literal>service</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>987</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Service number (within RDMA_PS_TCP).</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>cksum</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Set non-zero to enable message (not RDMA) checksums.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>timeout</literal>
+                 </para>
+               </entry>
+               <entry>
+               <para>
+                 <literal>50</literal>
+               </para>
+             </entry>
+               <entry>
+                 <para>Timeout in seconds.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>nscheds</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of threads in each scheduler pool (per CPT).  Value of
+          zero means we derive the number from the number of cores.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>conns_per_peer</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>4 (OmniPath), 1 (Everything else)</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Introduced in 2.10. Number of connections to each peer. Messages
+          are sent round-robin over the connection pool.  Provides significant
+          improvement with OmniPath.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ntx</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>512</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of message descriptors allocated for each pool at
+          startup. Grows at runtime. Shared by all CPTs.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>256</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of concurrent sends on network.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>8</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of concurrent sends to 1 peer. Related/limited by IB
+          queue size.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_credits_hiw</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>When eagerly to return credits.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_buffer_credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number per-peer router buffer credits.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_timeout</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>180</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Seconds without aliveness news to declare peer dead (less than
+          or equal to 0 to disable).</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ipif_name</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>ib0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>IPoIB interface name.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>retry_count</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>5</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Retransmissions when no ACK received.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>rnr_retry_count</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>6</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>RNR retransmissions.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>keepalive</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>100</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Idle time in seconds before sending a keepalive.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ib_mtu</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>IB MTU 256/512/1024/2048/4096.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>concurrent_sends</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Send work-queue sizing. If zero, derived from
+          <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>map_on_demand</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+            <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of fragments reserved for connection.  If zero, use
+          global memory region (found to be security issue).  If non-zero, use
+          FMR or FastReg for memory registration.  Value needs to agree between
+          both peers of connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_pool_size</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>512</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Size of fmr pool on each CPT (>= ntx / 4).  Grows at runtime.
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_flush_trigger</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>384</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number dirty FMRs that triggers pool flush.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_cache</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>1</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Non-zero to enable FMR caching.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>dev_failover</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>require_privileged_port</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Require privileged port when accepting connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>use_privileged_port</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>1</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Use privileged port when initiating connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>wrq_sge</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>2</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Introduced in 2.10. Number scatter/gather element groups per
+          work request.  Used to deal with fragmentations which can consume
+          double the number of work requests.</para>
+               </entry>
+             </row>
+           </tbody>
+         </tgroup>
+       </informaltable>
+    </section>
+  </section>
+  <section xml:id="dbdoclet.nrstuning">
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+    </indexterm>Network Request Scheduler (NRS) Tuning</title>
+    <para>The Network Request Scheduler (NRS) allows the administrator to
+    influence the order in which RPCs are handled at servers, on a per-PTLRPC
+    service basis, by providing different policies that can be activated and
+    tuned in order to influence the RPC ordering. The aim of this is to provide
+    for better performance, and possibly discrete performance characteristics
+    using future policies.</para>
+    <para>The NRS policy state of a PTLRPC service can be read and set via the 
+    <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
+    service's NRS policy state, run:</para>
+    <screen>
+lctl get_param {service}.nrs_policies
+</screen>
+    <para>For example, to read the NRS policy state of the 
+    <literal>ost_io</literal> service, run:</para>
+    <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_policies
+ost.OSS.ost_io.nrs_policies=
+
+regular_requests:
+  - name: fifo
+    state: started
+    fallback: yes
+    queued: 0
+    active: 0
+
+  - name: crrn
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: orr
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: trr
+    state: started
+    fallback: no
+    queued: 2420
+    active: 268
+
+  - name: tbf
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: delay
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+high_priority_requests:
+  - name: fifo
+    state: started
+    fallback: yes
+    queued: 0
+    active: 0
+
+  - name: crrn
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: orr
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: trr
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: tbf
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: delay
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+</screen>
+    <para>NRS policy state is shown in either one or two sections, depending on
+    the PTLRPC service being queried. The first section is named 
+    <literal>regular_requests</literal> and is available for all PTLRPC
+    services, optionally followed by a second section which is named 
+    <literal>high_priority_requests</literal>. This is because some PTLRPC
+    services are able to treat some types of RPCs as higher priority ones, such
+    that they are handled by the server with higher priority compared to other,
+    regular RPC traffic. For PTLRPC services that do not support high-priority
+    RPCs, you will only see the 
+    <literal>regular_requests</literal> section.</para>
+    <para>There is a separate instance of each NRS policy on each PTLRPC
+    service for handling regular and high-priority RPCs (if the service
+    supports high-priority RPCs). For each policy instance, the following
+    fields are shown:</para>
+    <informaltable frame="all">
+      <tgroup cols="2">
+        <colspec colname="c1" colwidth="50*" />
+        <colspec colname="c2" colwidth="50*" />
+        <thead>
+          <row>
+            <entry>
+              <para>
+                <emphasis role="bold">Field</emphasis>
+              </para>
+            </entry>
+            <entry>
+              <para>
+                <emphasis role="bold">Description</emphasis>
+              </para>
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry>
+              <para>
+                <literal>name</literal>
+              </para>
+            </entry>
+            <entry>
+              <para>The name of the policy.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para>
+                <literal>state</literal>
+              </para>
+            </entry>
+            <entry>
+              <para>The state of the policy; this can be any of 
+              <literal>invalid, stopping, stopped, starting, started</literal>.
+              A fully enabled policy is in the 
+              <literal>started</literal> state.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para>
+                <literal>fallback</literal>
+              </para>
+            </entry>
+            <entry>
+              <para>Whether the policy is acting as a fallback policy or not. A
+              fallback policy is used to handle RPCs that other enabled
+              policies fail to handle, or do not support the handling of. The
+              possible values are 
+              <literal>no, yes</literal>. Currently, only the FIFO policy can
+              act as a fallback policy.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para>
+                <literal>queued</literal>
+              </para>
+            </entry>
+            <entry>
+              <para>The number of RPCs that the policy has waiting to be
+              serviced.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para>
+                <literal>active</literal>
+              </para>
+            </entry>
+            <entry>
+              <para>The number of RPCs that the policy is currently
+              handling.</para>
+            </entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </informaltable>
+    <para>To enable an NRS policy on a PTLRPC service run:</para>
+    <screen>
+lctl set_param {service}.nrs_policies=
+<replaceable>policy_name</replaceable>
+</screen>
+    <para>This will enable the policy 
+    <replaceable>policy_name</replaceable>for both regular and high-priority
+    RPCs (if the PLRPC service supports high-priority RPCs) on the given
+    service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
+    service, run:</para>
+    <screen>
+$ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
+ldlm.services.ldlm_cbd.nrs_policies=crrn
+      
+</screen>
+    <para>For PTLRPC services that support high-priority RPCs, you can also
+    supply an optional 
+    <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
+    for handling only regular or high-priority RPCs on a given PTLRPC service,
+    by running:</para>
+    <screen>
+lctl set_param {service}.nrs_policies="
+<replaceable>policy_name</replaceable> 
+<replaceable>reg|hp</replaceable>"
+</screen>
+    <para>For example, to enable the TRR policy for handling only regular, but
+    not high-priority RPCs on the 
+    <literal>ost_io</literal> service, run:</para>
+    <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
+ost.OSS.ost_io.nrs_policies="trr reg"
+      
+</screen>
+    <note>
+      <para>When enabling an NRS policy, the policy name must be given in
+      lower-case characters, otherwise the operation will fail with an error
+      message.</para>
+    </note>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>first in, first out (FIFO) policy</tertiary>
+      </indexterm>First In, First Out (FIFO) policy</title>
+      <para>The first in, first out (FIFO) policy handles RPCs in a service in
+      the same order as they arrive from the LNet layer, so no special
+      processing takes place to modify the RPC handling stream. FIFO is the
+      default policy for all types of RPCs on all PTLRPC services, and is
+      always enabled irrespective of the state of other policies, so that it
+      can be used as a backup policy, in case a more elaborate policy that has
+      been enabled fails to handle an RPC, or does not support handling a given
+      type of RPC.</para>
+      <para>The FIFO policy has no tunables that adjust its behaviour.</para>
+    </section>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
+      </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
+      <para>The client round-robin over NIDs (CRR-N) policy performs batched
+      round-robin scheduling of all types of RPCs, with each batch consisting
+      of RPCs originating from the same client node, as identified by its NID.
+      CRR-N aims to provide for better resource utilization across the cluster,
+      and to help shorten completion times of jobs in some cases, by
+      distributing available bandwidth more evenly across all clients.</para>
+      <para>The CRR-N policy can be enabled on all types of PTLRPC services,
+      and has the following tunable that can be used to adjust its
+      behavior:</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_crrn_quantum</literal>
+          </para>
+          <para>The 
+          <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
+          maximum allowed size of each batch of RPCs; the unit of measure is in
+          number of RPCs. To read the maximum allowed batch size of a CRR-N
+          policy, run:</para>
+          <screen>
+lctl get_param {service}.nrs_crrn_quantum
+</screen>
+          <para>For example, to read the maximum allowed batch size of a CRR-N
+          policy on the ost_io service, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
+ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
+hp_quantum:8
+          
+</screen>
+          <para>You can see that there is a separate maximum allowed batch size
+          value for regular (
+          <literal>reg_quantum</literal>) and high-priority (
+          <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+          high-priority RPCs).</para>
+          <para>To set the maximum allowed batch size of a CRR-N policy on a
+          given service, run:</para>
+          <screen>
+lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+          <para>This will set the maximum allowed batch size on a given
+          service, for both regular and high-priority RPCs (if the PLRPC
+          service supports high-priority RPCs), to the indicated value.</para>
+          <para>For example, to set the maximum allowed batch size on the
+          ldlm_canceld service to 16 RPCs, run:</para>
+          <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+          
+</screen>
+          <para>For PTLRPC services that support high-priority RPCs, you can
+          also specify a different maximum allowed batch size for regular and
+          high-priority RPCs, by running:</para>
+          <screen>
+$ lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>"
+</screen>
+          <para>For example, to set the maximum allowed batch size on the
+          ldlm_canceld service, for high-priority RPCs to 32, run:</para>
+          <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
+          
+</screen>
+          <para>By using the last method, you can also set the maximum regular
+          and high-priority RPC batch sizes to different values, in a single
+          command invocation.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>object-based round-robin (ORR) policy</tertiary>
+      </indexterm>Object-based Round-Robin (ORR) policy</title>
+      <para>The object-based round-robin (ORR) policy performs batched
+      round-robin scheduling of bulk read write (brw) RPCs, with each batch
+      consisting of RPCs that pertain to the same backend-file system object,
+      as identified by its OST FID.</para>
+      <para>The ORR policy is only available for use on the ost_io service. The
+      RPC batches it forms can potentially consist of mixed bulk read and bulk
+      write RPCs. The RPCs in each batch are ordered in an ascending manner,
+      based on either the file offsets, or the physical disk offsets of each
+      RPC (only applicable to bulk read RPCs).</para>
+      <para>The aim of the ORR policy is to provide for increased bulk read
+      throughput in some cases, by ordering bulk read RPCs (and potentially
+      bulk write RPCs), and thus minimizing costly disk seek operations.
+      Performance may also benefit from any resulting improvement in resource
+      utilization, or by taking advantage of better locality of reference
+      between RPCs.</para>
+      <para>The ORR policy has the following tunables that can be used to
+      adjust its behaviour:</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
+          </para>
+          <para>The 
+          <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
+          the maximum allowed size of each batch of RPCs; the unit of measure
+          is in number of RPCs. To read the maximum allowed batch size of the
+          ORR policy, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
+hp_quantum:16
+          
+</screen>
+          <para>You can see that there is a separate maximum allowed batch size
+          value for regular (
+          <literal>reg_quantum</literal>) and high-priority (
+          <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+          high-priority RPCs).</para>
+          <para>To set the maximum allowed batch size for the ORR policy,
+          run:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+          <para>This will set the maximum allowed batch size for both regular
+          and high-priority RPCs, to the indicated value.</para>
+          <para>You can also specify a different maximum allowed batch size for
+          regular and high-priority RPCs, by running:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>
+</screen>
+          <para>For example, to set the maximum allowed batch size for regular
+          RPCs to 128, run:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+          
+</screen>
+          <para>By using the last method, you can also set the maximum regular
+          and high-priority RPC batch sizes to different values, in a single
+          command invocation.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
+          </para>
+          <para>The 
+          <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
+          determines whether the ORR policy orders RPCs within each batch based
+          on logical file offsets or physical disk offsets. To read the offset
+          type value for the ORR policy, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
+ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
+hp_offset_type:logical
+          
+</screen>
+          <para>You can see that there is a separate offset type value for
+          regular (
+          <literal>reg_offset_type</literal>) and high-priority (
+          <literal>hp_offset_type</literal>) RPCs.</para>
+          <para>To set the ordering type for the ORR policy, run:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>physical|logical</replaceable>
+</screen>
+          <para>This will set the offset type for both regular and
+          high-priority RPCs, to the indicated value.</para>
+          <para>You can also specify a different offset type for regular and
+          high-priority RPCs, by running:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>reg_offset_type|hp_offset_type</replaceable>:
+<replaceable>physical|logical</replaceable>
+</screen>
+          <para>For example, to set the offset type for high-priority RPCs to
+          physical disk offsets, run:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+</screen>
+          <para>By using the last method, you can also set offset type for
+          regular and high-priority RPCs to different values, in a single
+          command invocation.</para>
+          <note>
+            <para>Irrespective of the value of this tunable, only logical
+            offsets can, and are used for ordering bulk write RPCs.</para>
+          </note>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
+          </para>
+          <para>The 
+          <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
+          the type of RPCs that the ORR policy will handle. To read the types
+          of supported RPCs by the ORR policy, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_supported
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
+hp_supported=reads_and_writes
+          
+</screen>
+          <para>You can see that there is a separate supported 'RPC types'
+          value for regular (
+          <literal>reg_supported</literal>) and high-priority (
+          <literal>hp_supported</literal>) RPCs.</para>
+          <para>To set the supported RPC types for the ORR policy, run:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+          <para>This will set the supported RPC types for both regular and
+          high-priority RPCs, to the indicated value.</para>
+          <para>You can also specify a different supported 'RPC types' value
+          for regular and high-priority RPCs, by running:</para>
+          <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reg_supported|hp_supported</replaceable>:
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+          <para>For example, to set the supported RPC types to bulk read and
+          bulk write RPCs for regular requests, run:</para>
+          <screen>
+$ lctl set_param
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+          
+</screen>
+          <para>By using the last method, you can also set the supported RPC
+          types for regular and high-priority RPC to different values, in a
+          single command invocation.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>Target-based round-robin (TRR) policy</tertiary>
+      </indexterm>Target-based Round-Robin (TRR) policy</title>
+      <para>The target-based round-robin (TRR) policy performs batched
+      round-robin scheduling of brw RPCs, with each batch consisting of RPCs
+      that pertain to the same OST, as identified by its OST index.</para>
+      <para>The TRR policy is identical to the object-based round-robin (ORR)
+      policy, apart from using the brw RPC's target OST index instead of the
+      backend-fs object's OST FID, for determining the RPC scheduling order.
+      The goals of TRR are effectively the same as for ORR, and it uses the
+      following tunables to adjust its behaviour:</para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
+          </para>
+          <para>The purpose of this tunable is exactly the same as for the 
+          <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
+          policy, and you can use it in exactly the same way.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
+          </para>
+          <para>The purpose of this tunable is exactly the same as for the 
+          <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
+          ORR policy, and you can use it in exactly the same way.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
+          </para>
+          <para>The purpose of this tunable is exactly the same as for the 
+          <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
+          ORR policy, and you can use it in exactly the sme way.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section xml:id="dbdoclet.tbftuning" condition='l26'>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>Token Bucket Filter (TBF) policy</tertiary>
+      </indexterm>Token Bucket Filter (TBF) policy</title>
+      <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
+      Lustre services to enforce the RPC rate limit on clients/jobs for QoS
+      (Quality of Service) purposes.</para>
+      <figure>
+        <title>The internal structure of TBF policy</title>
+        <mediaobject>
+          <imageobject>
+            <imagedata scalefit="1" width="50%"
+            fileref="figures/TBF_policy.png" />
+          </imageobject>
+          <textobject>
+            <phrase>The internal structure of TBF policy</phrase>
+          </textobject>
+        </mediaobject>
+      </figure>
+      <para>When a RPC request arrives, TBF policy puts it to a waiting queue
+      according to its classification. The classification of RPC requests is
+      based on either NID or JobID of the RPC according to the configure of
+      TBF. TBF policy maintains multiple queues in the system, one queue for
+      each category in the classification of RPC requests. The requests waits
+      for tokens in the FIFO queue before they have been handled so as to keep
+      the RPC rates under the limits.</para>
+      <para>When Lustre services are too busy to handle all of the requests in
+      time, all of the specified rates of the queues will not be satisfied.
+      Nothing bad will happen except some of the RPC rates are slower than
+      configured. In this case, the queue with higher rate will have an
+      advantage over the queues with lower rates, but none of them will be
+      starved.</para>
+      <para>To manage the RPC rate of queues, we don't need to set the rate of
+      each queue manually. Instead, we define rules which TBF policy matches to
+      determine RPC rate limits. All of the defined rules are organized as an
+      ordered list. Whenever a queue is newly created, it goes though the rule
+      list and takes the first matched rule as its rule, so that the queue
+      knows its RPC token rate. A rule can be added to or removed from the list
+      at run time. Whenever the list of rules is changed, the queues will
+      update their matched rules.</para>
+      <section remap="h4">
+       <title>Enable TBF policy</title>
+       <para>Command:</para>
+       <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf &lt;<replaceable>policy</replaceable>&gt;"
+       </screen>
+       <para>For now, the RPCs can be classified into the different types
+       according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+       policy, you can specify one of the types, or just use "tbf" to enable
+       all of them to do a fine-grained RPC requests classification.</para>
+       <para>Example:</para>
+       <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
+      </section>
+      <section remap="h4">
+       <title>Start a TBF rule</title>
+       <para>The TBF rule is defined in the parameter
+       <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
+       <para>Command:</para>
+       <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
+       </screen>
+       <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
+       policy rule's name and '<replaceable>arguments</replaceable>' is a
+       string to specify the detailed rule according to the different types.
+       </para>
+       <itemizedlist>
+       <para>Next, the different types of TBF policies will be described.</para>
+         <listitem>
+           <para><emphasis role="bold">NID based TBF policy</emphasis></para>
+           <para>Command:</para>
+            <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+            <para>'<replaceable>nidlist</replaceable>' uses the same format
+           as configuring LNET route. '<replaceable>rate</replaceable>' is
+           the (upper limit) RPC rate of the rule.</para>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start other_clients nid={192.168.*.*@tcp} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+            <para>In this example, the rate of processing RPC requests from
+           compute nodes is at most 5x as fast as those from login nodes.
+           The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
+           like:</para>
+           <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0
+high_priority_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start loginnode nid={192.168.1.1@tcp} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
+            <para>For the JobID, please see
+            <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+            linkend="dbdoclet.jobstats" /> for more details.</para>
+           <para>Command:</para>
+            <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+           <para>Wildcard is supported in
+           {<replaceable>jobid_list</replaceable>}.</para>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start dd_user jobid={dd.*} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={*.600} rate=10"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user2 jobid={io*.10* *.500} rate=200"</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
+           <para>Command:</para>
+            <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
+         </listitem>
+         <listitem>
+      <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+           <para>Command:</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+           <para>Exapmle:</para>
+           <para>Limit the rate of RPC requests of the uid 500</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+           <para>Limit the rate of RPC requests of the gid 500</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+           <para>Also, you can use the following rule to control all reqs
+           to mds:</para>
+           <para>Start the tbf uid QoS on MDS:</para>
+           <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+           <para>Limit the rate of RPC requests of the uid 500</para>
+           <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">Policy combination</emphasis></para>
+           <para>To support TBF rules with complex expressions of conditions,
+           TBF classifier is extented to classify RPC in a more fine-grained
+           way. This feature supports logical conditional conjunction and
+           disjunction operations among different types.
+           In the rule:
+           "&amp;" represents the conditional conjunction and
+           "," represents the conditional disjunction.</para>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start comp_rule opcode={ost_write}&amp;jobid={dd.0},\
+nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
+           <para>In this example, those RPCs whose <literal>opcode</literal> is
+           ost_write and <literal>jobid</literal> is dd.0, or
+           <literal>nid</literal> satisfies the condition of
+           {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
+           req/sec.
+           The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
+           </para>
+           <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0</screen>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&amp;gid={500} rate=100"</screen>
+           <para>In this example, those RPC requests whose uid is 500 and
+           gid is 500 will be processed at the rate of 100 req/sec.</para>
+         </listitem>
+       </itemizedlist>
+      </section>
+      <section remap="h4">
+          <title>Change a TBF rule</title>
+          <para>Command:</para>
+          <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
+          </screen>
+          <para>Example:</para>
+          <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp change loginnode rate=200"
+</screen>
+      </section>
+      <section remap="h4">
+          <title>Stop a TBF rule</title>
+          <para>Command:</para>
+          <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
+<replaceable>rule_name</replaceable>"</screen>
+          <para>Example:</para>
+          <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
+      </section>
+      <section remap="h4">
+        <title>Rule options</title>
+       <para>To support more flexible rule conditions, the following options
+       are added.</para>
+       <itemizedlist>
+         <listitem>
+           <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
+           <para>By default, a newly started rule is prior to the old ones,
+           but by specifying the argument '<literal>rank=</literal>' when
+           inserting a new rule with "<literal>start</literal>" command,
+           the rank of the rule can be changed. Also, it can be changed by
+           "<literal>change</literal>" command.
+           </para>
+           <para>Command:</para>
+           <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
+lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
+</screen>
+           <para>By specifying the existing rule
+           '<replaceable>obj_rule_name</replaceable>', the new rule
+           '<replaceable>rule_name</replaceable>' will be moved to the front of
+           '<replaceable>obj_rule_name</replaceable>'.</para>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={iozone.500 dd.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
+           <para>In this example, rule "iozone_user1" is added to the front of
+           rule "computes". We can see the order by the following command:
+           </para>
+           <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">TBF realtime policies under congestion
+           </emphasis></para>
+           <para>During TBF evaluation, we find that when the sum of I/O
+           bandwidth requirements for all classes exceeds the system capacity,
+           the classes with the same rate limits get less bandwidth than if
+           preconfigured evenly. The reason for this is the heavy load on a
+           congested server will result in some missed deadlines for some
+           classes. The number of the calculated tokens may be larger than 1
+           during dequeuing. In the original implementation, all classes are
+           equally handled to simply discard exceeding tokens.</para>
+           <para>Thus, a Hard Token Compensation (HTC) strategy has been
+           implemented. A class can be configured with the HTC feature by the
+           rule it matches. This feature means that requests in this kind of
+           class queues have high real-time requirements and that the bandwidth
+           assignment must be satisfied as good as possible. When deadline
+           misses happen, the class keeps the deadline unchanged and the time
+           residue(the remainder of elapsed time divided by 1/r) is compensated
+           to the next round. This ensures that the next idle I/O thread will
+           always select this class to serve until all accumulated exceeding
+           tokens are handled or there are no pending requests in the class
+           queue.</para>
+           <para>Command:</para>
+           <para>A new command format is added to enable the realtime feature
+           for a rule:</para>
+           <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+           <para>This example rule means the RPC requests whose JobID is dd.0
+           will be processed at the rate of 100req/sec in realtime.</para>
+         </listitem>
+       </itemizedlist>
+      </section>
+    </section>
+    <section xml:id="dbdoclet.delaytuning" condition='l2A'>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>Delay policy</tertiary>
+      </indexterm>Delay policy</title>
+      <para>The NRS Delay policy seeks to perturb the timing of request
+      processing at the PtlRPC layer, with the goal of simulating high server
+      load, and finding and exposing timing related problems. When this policy
+      is active, upon arrival of a request the policy will calculate an offset,
+      within a defined, user-configurable range, from the request arrival
+      time, to determine a time after which the request should be handled.
+      The request is then stored using the cfs_binheap implementation,
+      which sorts the request according to the assigned start time.
+      Requests are removed from the binheap for handling once their start
+      time has been passed.</para>
+      <para>The Delay policy can be enabled on all types of PtlRPC services,
+      and has the following tunables that can be used to adjust its behavior:
+      </para>
+      <itemizedlist>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_delay_min</literal>
+          </para>
+          <para>The
+          <literal>{service}.nrs_delay_min</literal> tunable controls the
+          minimum amount of time, in seconds, that a request will be delayed by
+          this policy.  The default is 5 seconds. To read this value run:</para>
+          <screen>
+lctl get_param {service}.nrs_delay_min</screen>
+          <para>For example, to read the minimum delay set on the ost_io
+          service, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_min
+ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
+hp_delay_min:5</screen>
+        <para>To set the minimum delay in RPC processing, run:</para>
+        <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
+        <para>This will set the minimum delay time on a given service, for both
+        regular and high-priority RPCs (if the PtlRPC service supports
+        high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the minimum delay time on the ost_io service
+        to 10, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
+ost.OSS.ost_io.nrs_delay_min=10</screen>
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different minimum delay time for regular and high-priority RPCs, run:
+        </para>
+        <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
+        </screen>
+        <para>For example, to set the minimum delay time on the ost_io service
+        for high-priority RPCs to 3, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
+ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
+        <para>Note, in all cases the minimum delay time cannot exceed the
+        maximum delay time.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_delay_max</literal>
+          </para>
+          <para>The
+          <literal>{service}.nrs_delay_max</literal> tunable controls the
+          maximum amount of time, in seconds, that a request will be delayed by
+          this policy.  The default is 300 seconds. To read this value run:
+          </para>
+          <screen>lctl get_param {service}.nrs_delay_max</screen>
+          <para>For example, to read the maximum delay set on the ost_io
+          service, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_max
+ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
+hp_delay_max:300</screen>
+        <para>To set the maximum delay in RPC processing, run:</para>
+        <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
+</screen>
+        <para>This will set the maximum delay time on a given service, for both
+        regular and high-priority RPCs (if the PtlRPC service supports
+        high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the maximum delay time on the ost_io service
+        to 60, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
+ost.OSS.ost_io.nrs_delay_max=60</screen>
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different maximum delay time for regular and high-priority RPCs, run:
+        </para>
+        <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
+        <para>For example, to set the maximum delay time on the ost_io service
+        for high-priority RPCs to 30, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
+ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
+        <para>Note, in all cases the maximum delay time cannot be less than the
+        minimum delay time.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_delay_pct</literal>
+          </para>
+          <para>The
+          <literal>{service}.nrs_delay_pct</literal> tunable controls the
+          percentage of requests that will be delayed by this policy. The
+          default is 100. Note, when a request is not selected for handling by
+          the delay policy due to this variable then the request will be handled
+          by whatever fallback policy is defined for that service. If no other
+          fallback policy is defined then the request will be handled by the
+          FIFO policy.  To read this value run:</para>
+          <screen>lctl get_param {service}.nrs_delay_pct</screen>
+          <para>For example, to read the percentage of requests being delayed on
+          the ost_io service, run:</para>
+          <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_pct
+ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
+hp_delay_pct:100</screen>
+        <para>To set the percentage of delayed requests, run:</para>
+        <screen>
+lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
+        <para>This will set the percentage of requests delayed on a given
+        service, for both regular and high-priority RPCs (if the PtlRPC service
+        supports high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the percentage of delayed requests on the
+        ost_io service to 50, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
+ost.OSS.ost_io.nrs_delay_pct=50
+</screen>
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different delay percentage for regular and high-priority RPCs, run:
+        </para>
+        <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
+</screen>
+        <para>For example, to set the percentage of delayed requests on the
+        ost_io service for high-priority RPCs to 5, run:</para>
+        <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+</screen>
+        </listitem>
+      </itemizedlist>
+    </section>
    </section>
    <section xml:id="dbdoclet.50438272_25884">
    </section>
    <section xml:id="dbdoclet.50438272_25884">
-      <title><indexterm><primary>tuning</primary><secondary>lockless I/O</secondary></indexterm>Lockless I/O Tunables</title>
-    <para>The lockless I/O tunable feature allows servers to ask clients to do lockless I/O (liblustre-style where the server does the locking) on contended files.</para>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>lockless I/O</secondary>
+    </indexterm>Lockless I/O Tunables</title>
+    <para>The lockless I/O tunable feature allows servers to ask clients to do
+    lockless I/O (the server does the locking on behalf of clients) for
+    contended files to avoid lock ping-pong.</para>
      <para>The lockless I/O patch introduces these tunables:</para>
      <itemizedlist>
        <listitem>
      <para>The lockless I/O patch introduces these tunables:</para>
      <itemizedlist>
        <listitem>
-        <para><emphasis role="bold">OST-side:</emphasis></para>
-        <screen>/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
+        <para>
+          <emphasis role="bold">OST-side:</emphasis>
+        </para>
+        <screen>
+ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
  </screen>
  </screen>
-        <para><literal>contended_locks</literal> - If the number of lock conflicts in the scan of granted and waiting queues at contended_locks is exceeded, the resource is considered to be contended.</para>
-        <para><literal>contention_seconds</literal> - The resource keeps itself in a contended state as set in the parameter.</para>
-        <para><literal>max_nolock_bytes</literal> - Server-side locking set only for requests less than the blocks set in the <literal>max_nolock_bytes</literal> parameter. If this tunable is set to zero (0), it disables server-side locking for read/write requests.</para>
+        <para>
+        <literal>contended_locks</literal>- If the number of lock conflicts in
+        the scan of granted and waiting queues at contended_locks is exceeded,
+        the resource is considered to be contended.</para>
+        <para>
+        <literal>contention_seconds</literal>- The resource keeps itself in a
+        contended state as set in the parameter.</para>
+        <para>
+        <literal>max_nolock_bytes</literal>- Server-side locking set only for
+        requests less than the blocks set in the
+        <literal>max_nolock_bytes</literal> parameter. If this tunable is
+        set to zero (0), it disables server-side locking for read/write
+        requests.</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Client-side:</emphasis></para>
-        <screen>/proc/fs/lustre/llite/lustre-*</screen>
-        <para><literal>contention_seconds</literal> - <literal>llite</literal> inode remembers its contended state for the time specified in this parameter.</para>
+        <para>
+          <emphasis role="bold">Client-side:</emphasis>
+        </para>
+        <screen>
+/proc/fs/lustre/llite/lustre-*
+</screen>
+        <para>
+        <literal>contention_seconds</literal>- 
+        <literal>llite</literal> inode remembers its contended state for the
+        time specified in this parameter.</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Client-side statistics:</emphasis></para>
-        <para>The <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new rows for lockless I/O statistics.</para>
-        <para><literal>lockless_read_bytes</literal> and <literal>lockless_write_bytes</literal> - To count the total bytes read or written, the client makes its own decisions based on the request size. The client does not communicate with the server if the request size is smaller than the <literal>min_nolock_size</literal>, without acquiring locks by the client.</para>
+        <para>
+          <emphasis role="bold">Client-side statistics:</emphasis>
+        </para>
+        <para>The 
+        <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
+        rows for lockless I/O statistics.</para>
+        <para>
+        <literal>lockless_read_bytes</literal> and 
+        <literal>lockless_write_bytes</literal>- To count the total bytes read
+        or written, the client makes its own decisions based on the request
+        size. The client does not communicate with the server if the request
+        size is smaller than the 
+        <literal>min_nolock_size</literal>, without acquiring locks by the
+        client.</para>
        </listitem>
      </itemizedlist>
    </section>
        </listitem>
      </itemizedlist>
    </section>
+  <section condition="l29">
+      <title>
+        <indexterm>
+          <primary>tuning</primary>
+          <secondary>with lfs ladvise</secondary>
+        </indexterm>
+        Server-Side Advice and Hinting
+      </title>
+      <section><title>Overview</title>
+      <para>Use the <literal>lfs ladvise</literal> command to give file access
+      advices or hints to servers.</para>
+      <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
+[--start|-s START[kMGT]]
+{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
+<emphasis>file</emphasis> ...
+      </screen>
+      <para>
+        <informaltable frame="all">
+          <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Option</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para><literal>-a</literal>, <literal>--advice=</literal>
+                <literal>ADVICE</literal></para>
+              </entry>
+              <entry>
+                <para>Give advice or hint of type <literal>ADVICE</literal>.
+                Advice types are:</para>
+                <para><literal>willread</literal> to prefetch data into server
+                cache</para>
+                <para><literal>dontneed</literal> to cleanup data cache on
+                server</para>
+                <para><literal>lockahead</literal> Request an LDLM extent lock
+                of the given mode on the given byte range </para>
+                <para><literal>noexpand</literal> Disable extent lock expansion
+                behavior for I/O to this file descriptor</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para><literal>-b</literal>, <literal>--background</literal>
+                </para>
+              </entry>
+              <entry>
+                <para>Enable the advices to be sent and handled asynchronously.
+                </para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para><literal>-s</literal>, <literal>--start=</literal>
+                        <literal>START_OFFSET</literal></para>
+              </entry>
+              <entry>
+                <para>File range starts from <literal>START_OFFSET</literal>
+                </para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-e</literal>, <literal>--end=</literal>
+                        <literal>END_OFFSET</literal></para>
+                </entry>
+                <entry>
+                    <para>File range ends at (not including)
+                    <literal>END_OFFSET</literal>.  This option may not be
+                    specified at the same time as the <literal>-l</literal>
+                    option.</para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-l</literal>, <literal>--length=</literal>
+                        <literal>LENGTH</literal></para>
+                </entry>
+                <entry>
+                  <para>File range has length of <literal>LENGTH</literal>.
+                  This option may not be specified at the same time as the
+                  <literal>-e</literal> option.</para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-m</literal>, <literal>--mode=</literal>
+                        <literal>MODE</literal></para>
+                </entry>
+                <entry>
+                  <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
+                  Request a lock with this mode.</para>
+                </entry>
+            </row>
+          </tbody>
+          </tgroup>
+        </informaltable>
+      </para>
+      <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
+      Lustre servers without guaranteeing when and what servers will react to
+      the advice. Actions may or may not triggered when the advices are
+      recieved, depending on the type of the advice, as well as the real-time
+      decision of the affected server-side components.</para>
+      <para>A typical usage of ladvise is to enable applications and users with
+      external knowledge to intervene in server-side cache management. For
+      example, if a bunch of different clients are doing small random reads of a
+      file, prefetching pages into OSS cache with big linear reads before the
+      random IO is a net benefit. Fetching that data into each client cache with
+      fadvise() may not be, due to much more data being sent to the client.
+      </para>
+      <para>
+      <literal>ladvise lockahead</literal> is different in that it attempts to
+      control LDLM locking behavior by explicitly requesting LDLM locks in
+      advance of use.  This does not directly affect caching behavior, instead
+      it is used in special cases to avoid pathological results (lock exchange)
+      from the normal LDLM locking behavior.
+      </para>
+      <para>
+      Note that the <literal>noexpand</literal> advice works on a specific
+      file descriptor, so using it via lfs has no effect.  It must be used
+      on a particular file descriptor which is used for i/o to have any effect.
+      </para>
+      <para>The main difference between the Linux <literal>fadvise()</literal>
+      system call and <literal>lfs ladvise</literal> is that
+      <literal>fadvise()</literal> is only a client side mechanism that does
+      not pass the advice to the filesystem, while <literal>ladvise</literal>
+      can send advices or hints to the Lustre server side.</para>
+      </section>
+      <section><title>Examples</title>
+        <para>The following example gives the OST(s) holding the first 1GB of
+        <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
+        file will be read soon.</para>
+        <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
+        </screen>
+        <para>The following example gives the OST(s) holding the first 1GB of
+        <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
+        will not be read in the near future, thus the OST(s) could clear the
+        cache of the file in the memory.</para>
+        <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
+        </screen>
+        <para>The following example requests an LDLM read lock on the first
+       1 MiB of <literal>/mnt/lustre/file1</literal>.  This will attempt to
+       request a lock from the OST holding that region of the file.</para>
+        <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
+        </screen>
+        <para>The following example requests an LDLM write lock on
+       [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>.  This will
+       attempt to request a lock from the OST holding that region of the
+       file.</para>
+        <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
+        </screen>
+      </section>
+  </section>
+  <section condition="l29">
+      <title>
+          <indexterm>
+              <primary>tuning</primary>
+              <secondary>Large Bulk IO</secondary>
+          </indexterm>
+          Large Bulk IO (16MB RPC)
+      </title>
+      <section><title>Overview</title>
+          <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
+          to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
+          required to transfer the same amount of data between clients and
+          servers.  With a larger RPC size, the OSS can submit more data to the
+          underlying disks at once, therefore it can produce larger disk I/Os
+          to fully utilize the increasing bandwidth of disks.</para>
+          <para>At client connection time, clients will negotiate with
+          servers what the maximum RPC size it is possible to use, but the
+         client can always send RPCs smaller than this maximum.</para>
+          <para>The parameter <literal>brw_size</literal> is used on the OST
+         to tell the client the maximum (preferred) IO size.  All clients that
+          talk to this target should never send an RPC greater than this size.
+         Clients can individually set a smaller RPC size limit via the
+         <literal>osc.*.max_pages_per_rpc</literal> tunable.
+          </para>
+         <note>
+         <para>The smallest <literal>brw_size</literal> that can be set for
+         ZFS OSTs is the <literal>recordsize</literal> of that dataset.  This
+         ensures that the client can always write a full ZFS file block if it
+         has enough dirty data, and does not otherwise force it to do read-
+         modify-write operations for every RPC.
+          </para>
+         </note>
+      </section>
+      <section><title>Usage</title>
+          <para>In order to enable a larger RPC size,
+          <literal>brw_size</literal> must be changed to an IO size value up to
+          16MB.  To temporarily change <literal>brw_size</literal>, the
+          following command should be run on the OSS:</para>
+          <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
+          <para>To persistently change <literal>brw_size</literal>, the
+          following command should be run:</para>
+          <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
+          <para>When a client connects to an OST target, it will fetch
+          <literal>brw_size</literal> from the target and pick the maximum value
+          of <literal>brw_size</literal> and its local setting for
+          <literal>max_pages_per_rpc</literal> as the actual RPC size.
+          Therefore, the <literal>max_pages_per_rpc</literal> on the client side
+          would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
+          a 16MB RPC.  To temporarily make the change, the following command
+          should be run on the client to set
+          <literal>max_pages_per_rpc</literal>:</para>
+          <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
+          <para>To persistently make this change, the following command should
+          be run:</para>
+          <screen>client$ lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
+          <caution><para>The <literal>brw_size</literal> of an OST can be
+          changed on the fly.  However, clients have to be remounted to
+          renegotiate the new maximum RPC size.</para></caution>
+      </section>
+  </section>
    <section xml:id="dbdoclet.50438272_80545">
    <section xml:id="dbdoclet.50438272_80545">
-    <title><indexterm><primary>tuning</primary><secondary>for small files</secondary></indexterm>Improving Lustre Performance When Working with Small Files</title>
-    <para>A Lustre environment where an application writes small file chunks from many clients to a single file will result in bad I/O performance. To improve Lustre&apos;s performance with small files:</para>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>for small files</secondary>
+    </indexterm>Improving Lustre I/O Performance for Small Files</title>
+    <para>An environment where an application writes small file chunks from
+    many clients to a single file can result in poor I/O performance. To
+    improve the performance of the Lustre file system with small files:</para>
      <itemizedlist>
        <listitem>
      <itemizedlist>
        <listitem>
-        <para>Have the application aggregate writes some amount before submitting them to Lustre. By default, Lustre enforces POSIX coherency semantics, so it results in lock ping-pong between client nodes if they are all writing to the same file at one time.</para>
+        <para>Have the application aggregate writes some amount before
+        submitting them to the Lustre file system. By default, the Lustre
+        software enforces POSIX coherency semantics, so it results in lock
+        ping-pong between client nodes if they are all writing to the same
+        file at one time.</para>
+        <para>Using MPI-IO Collective Write functionality in
+        the Lustre ADIO driver is one way to achieve this in a straight
+        forward manner if the application is already using MPI-IO.</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
-        <para>Have the application do 4kB <literal>O_DIRECT</literal> sized I/O to the file and disable locking on the output file. This avoids partial-page IO submissions and, by disabling locking, you avoid contention between clients.</para>
+        <para>Have the application do 4kB
+        <literal>O_DIRECT</literal> sized I/O to the file and disable locking
+        on the output file. This avoids partial-page IO submissions and, by
+        disabling locking, you avoid contention between clients.</para>
        </listitem>
        <listitem>
          <para>Have the application write contiguous data.</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
          <para>Have the application write contiguous data.</para>
        </listitem>
        <listitem>
-        <para>Add more disks or use SSD disks for the OSTs. This dramatically improves the IOPS rate. Consider creating larger OSTs rather than many smaller OSTs due to less overhead (journal, connections, etc).</para>
+        <para>Add more disks or use SSD disks for the OSTs. This dramatically
+        improves the IOPS rate. Consider creating larger OSTs rather than many
+        smaller OSTs due to less overhead (journal, connections, etc).</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
-        <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity overhead for writing small chunks of data to disk.</para>
+        <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
+        overhead for writing small chunks of data to disk.</para>
        </listitem>
      </itemizedlist>
    </section>
    <section xml:id="dbdoclet.50438272_45406">
        </listitem>
      </itemizedlist>
    </section>
    <section xml:id="dbdoclet.50438272_45406">
-    <title><indexterm><primary>tuning</primary><secondary>write performance</secondary></indexterm>Understanding Why Write Performance is Better Than Read Performance</title>
-    <para>Typically, the performance of write operations on a Lustre cluster is better than read operations. When doing writes, all clients are sending write RPCs asynchronously. The RPCs are allocated, and written to disk in the order they arrive. In many cases, this allows the back-end storage to aggregate writes efficiently.</para>
-    <para>In the case of read operations, the reads from clients may come in a different order and need a lot of seeking to get read from the disk. This noticeably hampers the read throughput.</para>
-    <para>Currently, there is no readahead on the OSTs themselves, though the clients do readahead. If there are lots of clients doing reads it would not be possible to do any readahead in any case because of memory consumption (consider that even a single RPC (1 MB) readahead for 1000 clients would consume 1 GB of RAM).</para>
-    <para>For file systems that use socklnd (TCP, Ethernet) as interconnect, there is also additional CPU overhead because the client cannot receive data without copying it from the network buffers. In the write case, the client CAN send data without the additional data copy. This means that the client is more likely to become CPU-bound during reads than writes.</para>
+    <title>
+    <indexterm>
+      <primary>tuning</primary>
+      <secondary>write performance</secondary>
+    </indexterm>Understanding Why Write Performance is Better Than Read
+    Performance</title>
+    <para>Typically, the performance of write operations on a Lustre cluster is
+    better than read operations. When doing writes, all clients are sending
+    write RPCs asynchronously. The RPCs are allocated, and written to disk in
+    the order they arrive. In many cases, this allows the back-end storage to
+    aggregate writes efficiently.</para>
+    <para>In the case of read operations, the reads from clients may come in a
+    different order and need a lot of seeking to get read from the disk. This
+    noticeably hampers the read throughput.</para>
+    <para>Currently, there is no readahead on the OSTs themselves, though the
+    clients do readahead. If there are lots of clients doing reads it would not
+    be possible to do any readahead in any case because of memory consumption
+    (consider that even a single RPC (1 MB) readahead for 1000 clients would
+    consume 1 GB of RAM).</para>
+    <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
+    there is also additional CPU overhead because the client cannot receive
+    data without copying it from the network buffers. In the write case, the
+    client CAN send data without the additional data copy. This means that the
+    client is more likely to become CPU-bound during reads than writes.</para>
    </section>
  </chapter>
    </section>
  </chapter>
+<!--
+  vim:expandtab:shiftwidth=2:tabstop=8:
+  -->