Whamcloud - gitweb
LUDOC-11 osc: document tunable parameters
[doc/manual.git] / LustreTuning.xml
index 76a7f7d..e4970f1 100644 (file)
@@ -94,8 +94,8 @@ options ost oss_num_threads={N}
 lctl {get,set}_param {service}.thread_{min,max,started}
 </screen>
       </para>
-         <para condition='l23'>Lustre software release 2.3 introduced binding
-      service threads to CPU partition. This works in a similar fashion to 
+      <para>
+      This works in a similar fashion to 
       binding of threads on MDS. MDS thread tuning is covered in 
       <xref linkend="dbdoclet.mdsbinding" />.</para>
       <itemizedlist>
@@ -125,9 +125,7 @@ lctl {get,set}_param {service}.thread_{min,max,started}
       <literal>mds_num_threads</literal> parameter enables the number of MDS
       service threads to be specified at module load time on the MDS
       node:</para>
-      <screen>
-options mds mds_num_threads={N}
-</screen>
+      <screen>options mds mds_num_threads={N}</screen>
       <para>After startup, the minimum and maximum number of MDS thread counts
       can be set via the 
       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
@@ -139,19 +137,20 @@ lctl {get,set}_param {service}.thread_{min,max,started}
       </para>
       <para>For details, see 
       <xref linkend="dbdoclet.50438271_87260" />.</para>
-      <para>At this time, no testing has been done to determine the optimal
-      number of MDS threads. The default value varies, based on server size, up
-      to a maximum of 32. The maximum number of threads (
-      <literal>MDS_MAX_THREADS</literal>) is 512.</para>
+      <para>The number of MDS service threads started depends on system size
+      and the load on the server, and has a default maximum of 64. The
+      maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
+      is 1024.</para>
       <note>
-        <para>The OSS and MDS automatically start new service threads
-        dynamically, in response to server load within a factor of 4. The
-        default value is calculated the same way as before. Setting the 
-        <literal>_mu_threads</literal> module parameter disables automatic
-        thread creation behavior.</para>
+        <para>The OSS and MDS start two threads per service per CPT at mount
+       time, and dynamically increase the number of running service threads in
+       response to server load. Setting the <literal>*_num_threads</literal>
+       module parameter starts the specified number of threads for that
+       service immediately and disables automatic thread creation behavior.
+       </para>
       </note>
-      <para>Lustre software release 2.3 introduced new parameters to provide
-      more control to administrators.</para>
+      <para condition='l23'>Lustre software release 2.3 introduced new
+      parameters to provide more control to administrators.</para>
       <itemizedlist>
         <listitem>
           <para>
@@ -166,12 +165,6 @@ lctl {get,set}_param {service}.thread_{min,max,started}
           release 1.8.</para>
         </listitem>
       </itemizedlist>
-      <note>
-        <para>Default values for the thread counts are automatically selected.
-        The values are chosen to best exploit the number of CPUs present in the
-        system and to provide best overall performance for typical
-        workloads.</para>
-      </note>
     </section>
   </section>
   <section xml:id="dbdoclet.mdsbinding" condition='l23'>
@@ -182,10 +175,13 @@ lctl {get,set}_param {service}.thread_{min,max,started}
     </indexterm>Binding MDS Service Thread to CPU Partitions</title>
     <para>With the introduction of Node Affinity (
     <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
-    can be bound to particular CPU partitions (CPTs). Default values for
+    can be bound to particular CPU partitions (CPTs) to improve CPU cache
+    usage and memory locality.  Default values for CPT counts and CPU core
     bindings are selected automatically to provide good overall performance for
     a given CPU count. However, an administrator can deviate from these setting
-    if they choose.</para>
+    if they choose.  For details on specifying the mapping of CPU cores to
+    CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
+    </para>
     <itemizedlist>
       <listitem>
         <para>
@@ -224,14 +220,14 @@ options mdt mds_num_cpts=[0]</screen>
   <section xml:id="dbdoclet.50438272_73839">
     <title>
     <indexterm>
-      <primary>LNET</primary>
+      <primary>LNet</primary>
       <secondary>tuning</secondary>
     </indexterm>
     <indexterm>
       <primary>tuning</primary>
-      <secondary>LNET</secondary>
-    </indexterm>Tuning LNET Parameters</title>
-    <para>This section describes LNET tunables, the use of which may be
+      <secondary>LNet</secondary>
+    </indexterm>Tuning LNet Parameters</title>
+    <para>This section describes LNet tunables, the use of which may be
     necessary on some systems to improve performance. To test the performance
     of your Lustre network, see 
     <xref linkend='lnetselftest' />.</para>
@@ -281,7 +277,7 @@ options ksocklnd enable_irq_affinity=0
       <para>Lustre software release 2.3 and beyond provide enhanced network
       interface control. The enhancement means that an administrator can bind
       an interface to one or more CPU partitions. Bindings are specified as
-      options to the LNET modules. For more information on specifying module
+      options to the LNet modules. For more information on specifying module
       options, see 
       <xref linkend="dbdoclet.50438293_15350" /></para>
       <para>For example, 
@@ -302,11 +298,11 @@ options ksocklnd enable_irq_affinity=0
       <para>Network interface (NI) credits are shared across all CPU partitions
       (CPT). For example, if a machine has four CPTs and the number of NI
       credits is 512, then each partition has 128 credits. If a large number of
-      CPTs exist on the system, LNET checks and validates the NI credits for
+      CPTs exist on the system, LNet checks and validates the NI credits for
       each CPT to ensure each CPT has a workable number of credits. For
       example, if a machine has 16 CPTs and the number of NI credits is 256,
       then each partition only has 16 credits. 16 NI credits is low and could
-      negatively impact performance. As a result, LNET automatically adjusts
+      negatively impact performance. As a result, LNet automatically adjusts
       the credits to 8*
       <literal>peer_credits</literal>(
       <literal>peer_credits</literal> is 8 by default), so each partition has 64
@@ -314,7 +310,7 @@ options ksocklnd enable_irq_affinity=0
       <para>Increasing the number of 
       <literal>credits</literal>/
       <literal>peer_credits</literal> can improve the performance of high
-      latency networks (at the cost of consuming more memory) by enabling LNET
+      latency networks (at the cost of consuming more memory) by enabling LNet
       to send more inflight messages to a specific network/peer and keep the
       pipeline saturated.</para>
       <para>An administrator can modify the NI credit count using 
@@ -329,7 +325,7 @@ ksocklnd credits=256
 ko2iblnd credits=256
 </screen>
       <note condition="l23">
-        <para>In Lustre software release 2.3 and beyond, LNET may revalidate
+        <para>In Lustre software release 2.3 and beyond, LNet may revalidate
         the NI credits, so the administrator's request may not persist.</para>
       </note>
     </section>
@@ -339,7 +335,7 @@ ko2iblnd credits=256
         <primary>tuning</primary>
         <secondary>router buffers</secondary>
       </indexterm>Router Buffers</title>
-      <para>When a node is set up as an LNET router, three pools of buffers are
+      <para>When a node is set up as an LNet router, three pools of buffers are
       allocated: tiny, small and large. These pools are allocated per CPU
       partition and are used to buffer messages that arrive at the router to be
       forwarded to the next hop. The three different buffer sizes accommodate
@@ -371,7 +367,7 @@ ko2iblnd credits=256
         </listitem>
       </itemizedlist>
       <para>The default setting for router buffers typically results in
-      acceptable performance. LNET automatically sets a default value to reduce
+      acceptable performance. LNet automatically sets a default value to reduce
       the likelihood of resource starvation. The size of a router buffer can be
       modified as shown in the example below. In this example, the size of the
       large buffer is modified using the 
@@ -380,7 +376,7 @@ ko2iblnd credits=256
 lnet large_router_buffers=8192
 </screen>
       <note condition="l23">
-        <para>In Lustre software release 2.3 and beyond, LNET may revalidate
+        <para>In Lustre software release 2.3 and beyond, LNet may revalidate
         the router buffer setting, so the administrator's request may not
         persist.</para>
       </note>
@@ -391,15 +387,15 @@ lnet large_router_buffers=8192
         <primary>tuning</primary>
         <secondary>portal round-robin</secondary>
       </indexterm>Portal Round-Robin</title>
-      <para>Portal round-robin defines the policy LNET applies to deliver
+      <para>Portal round-robin defines the policy LNet applies to deliver
       events and messages to the upper layers. The upper layers are PLRPC
-      service or LNET selftest.</para>
-      <para>If portal round-robin is disabled, LNET will deliver messages to
+      service or LNet selftest.</para>
+      <para>If portal round-robin is disabled, LNet will deliver messages to
       CPTs based on a hash of the source NID. Hence, all messages from a
       specific peer will be handled by the same CPT. This can reduce data
       traffic between CPUs. However, for some workloads, this behavior may
       result in poorly balancing loads across the CPU.</para>
-      <para>If portal round-robin is enabled, LNET will round-robin incoming
+      <para>If portal round-robin is enabled, LNet will round-robin incoming
       events across all CPTs. This may balance load better across the CPU but
       can incur a cross CPU overhead.</para>
       <para>The current policy can be changed by an administrator with 
@@ -439,7 +435,7 @@ lnet large_router_buffers=8192
       </itemizedlist>
     </section>
     <section>
-      <title>LNET Peer Health</title>
+      <title>LNet Peer Health</title>
       <para>Two options are available to help determine peer health:
       <itemizedlist>
         <listitem>
@@ -449,7 +445,7 @@ lnet large_router_buffers=8192
           <literal>peer_timeout</literal> is set to 
           <literal>180sec</literal>, an aliveness query is sent to the peer
           every 180 seconds. This feature only takes effect if the node is
-          configured as an LNET router.</para>
+          configured as an LNet router.</para>
           <para>In a routed environment, the 
           <literal>peer_timeout</literal> feature should always be on (set to a
           value in seconds) on routers. If the router checker has been enabled,
@@ -482,9 +478,9 @@ lnet large_router_buffers=8192
           all the routers corresponding to the NIDs identified in the routes
           parameter setting on the node to determine the status of each router
           interface. The default setting is 1. (For more information about the
-          LNET routes parameter, see 
+          LNet routes parameter, see 
           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
-          linkend="dbdoclet.50438216_71227" /></para>
+          linkend="lnet_module_routes" /></para>
           <para>A router is considered down if any of its NIDs are down. For
           example, router X has three NIDs: 
           <literal>Xnid1</literal>, 
@@ -529,45 +525,85 @@ lnet large_router_buffers=8192
       be MAX.</para>
     </section>
   </section>
-  <section xml:id="dbdoclet.libcfstuning">
+  <section xml:id="dbdoclet.libcfstuning" condition='l23'>
     <title>
     <indexterm>
       <primary>tuning</primary>
       <secondary>libcfs</secondary>
     </indexterm>libcfs Tuning</title>
+    <para>Lustre software release 2.3 introduced binding service threads via
+    CPU Partition Tables (CPTs). This allows the system administrator to
+    fine-tune on which CPU cores the Lustre service threads are run, for both
+    OSS and MDS services, as well as on the client.
+    </para>
+    <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
+    system functions such as system monitoring, HA heartbeat, or similar
+    tasks.  On the client it may be useful to restrict Lustre RPC service
+    threads to a small subset of cores so that they do not interfere with
+    computation, or because these cores are directly attached to the network
+    interfaces.
+    </para>
     <para>By default, the Lustre software will automatically generate CPU
-    partitions (CPT) based on the number of CPUs in the system. The CPT number
-    will be 1 if the online CPU number is less than five.</para>
-    <para>The CPT number can be explicitly set on the libcfs module using 
-    <literal>cpu_npartitions=NUMBER</literal>. The value of 
-    <literal>cpu_npartitions</literal> must be an integer between 1 and the
-    number of online CPUs.</para>
+    partitions (CPT) based on the number of CPUs in the system.
+    The CPT count can be explicitly set on the libcfs module using 
+    <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
+    The value of <literal>cpu_npartitions</literal> must be an integer between
+    1 and the number of online CPUs.
+    </para>
+    <para condition='l29'>In Lustre 2.9 and later the default is to use
+    one CPT per NUMA node.  In earlier versions of Lustre, by default there
+    was a single CPT if the online CPU core count was four or fewer, and
+    additional CPTs would be created depending on the number of CPU cores,
+    typically with 4-8 cores per CPT.
+    </para>
     <tip>
-      <para>Setting CPT to 1 will disable most of the SMP Node Affinity
-      functionality.</para>
+      <para>Setting <literal>cpu_npartitions=1</literal> will disable most
+      of the SMP Node Affinity functionality.</para>
     </tip>
     <section>
       <title>CPU Partition String Patterns</title>
-      <para>CPU partitions can be described using string pattern notation. For
-      example:</para>
+      <para>CPU partitions can be described using string pattern notation.
+      If <literal>cpu_pattern=N</literal> is used, then there will be one
+      CPT for each NUMA node in the system, with each CPT mapping all of
+      the CPU cores for that NUMA node.
+      </para>
+      <para>It is also possible to explicitly specify the mapping between
+      CPU cores and CPTs, for example:</para>
       <itemizedlist>
         <listitem>
           <para>
-            <literal>cpu_pattern="0[0,2,4,6] 1[1,3,5,7]</literal>
+            <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
           </para>
-          <para>Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains
-          CPU[1,3,5,7].</para>
+          <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
+         contains cores 3, 5, 7.  CPU cores 0 and 1 will not be used by Lustre
+         service threads, and could be used for node services such as
+         system monitoring, HA heartbeat threads, etc.  The binding of
+         non-Lustre services to those CPU cores may be done in userspace
+         using <literal>numactl(8)</literal> or other application-specific
+         methods, but is beyond the scope of this document.</para>
         </listitem>
         <listitem>
           <para>
             <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
           </para>
-          <para>Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1
-          contains all CPUs in NUMA node [4-7].</para>
+          <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
+         node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
         </listitem>
       </itemizedlist>
-      <para>The current configuration of the CPU partition can be read from 
-      <literal>/proc/sys/lnet/cpu_partition_table</literal></para>
+      <para>The current configuration of the CPU partition can be read via 
+      <literal>lctl get_parm cpu_partition_table</literal>.  For example,
+      a simple 4-core system has a single CPT with all four CPU cores:
+      <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=0  : 0 1 2 3</screen>
+      while a larger NUMA system with four 12-core CPUs may have four CPTs:
+      <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=
+0      : 0 1 2 3 4 5 6 7 8 9 10 11
+1      : 12 13 14 15 16 17 18 19 20 21 22 23
+2      : 24 25 26 27 28 29 30 31 32 33 34 35
+3      : 36 37 38 39 40 41 42 43 44 45 46 47
+</screen>
+      </para>
     </section>
   </section>
   <section xml:id="dbdoclet.lndtuning">
@@ -590,6 +626,429 @@ lnet large_router_buffers=8192
       default values are automatically set and are chosen to work well across a
       number of typical scenarios.</para>
     </note>
+    <section>
+       <title>ko2iblnd Tuning</title>
+       <para>The following table outlines the ko2iblnd module parameters to be used
+    for tuning:</para>
+       <informaltable frame="all">
+         <tgroup cols="3">
+           <colspec colname="c1" colwidth="50*" />
+           <colspec colname="c2" colwidth="50*" />
+           <colspec colname="c3" colwidth="50*" />
+           <thead>
+             <row>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Module Parameter</emphasis>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Default Value</emphasis>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <emphasis role="bold">Description</emphasis>
+                 </para>
+               </entry>
+             </row>
+           </thead>
+           <tbody>
+             <row>
+               <entry>
+                 <para>
+                   <literal>service</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>987</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Service number (within RDMA_PS_TCP).</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>cksum</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Set non-zero to enable message (not RDMA) checksums.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>timeout</literal>
+                 </para>
+               </entry>
+               <entry>
+               <para>
+                 <literal>50</literal>
+               </para>
+             </entry>
+               <entry>
+                 <para>Timeout in seconds.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>nscheds</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of threads in each scheduler pool (per CPT).  Value of
+          zero means we derive the number from the number of cores.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>conns_per_peer</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>4 (OmniPath), 1 (Everything else)</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Introduced in 2.10. Number of connections to each peer. Messages
+          are sent round-robin over the connection pool.  Provides signifiant
+          improvement with OmniPath.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ntx</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>512</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of message descriptors allocated for each pool at
+          startup. Grows at runtime. Shared by all CPTs.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>256</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of concurrent sends on network.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>8</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of concurrent sends to 1 peer. Related/limited by IB
+          queue size.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_credits_hiw</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>When eagerly to return credits.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_buffer_credits</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number per-peer router buffer credits.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>peer_timeout</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>180</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Seconds without aliveness news to declare peer dead (less than
+          or equal to 0 to disable).</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ipif_name</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>ib0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>IPoIB interface name.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>retry_count</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>5</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Retransmissions when no ACK received.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>rnr_retry_count</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>6</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>RNR retransmissions.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>keepalive</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>100</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Idle time in seconds before sending a keepalive.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>ib_mtu</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>IB MTU 256/512/1024/2048/4096.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>concurrent_sends</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Send work-queue sizing. If zero, derived from
+          <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>map_on_demand</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+            <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number of fragments reserved for connection.  If zero, use
+          global memory region (found to be security issue).  If non-zero, use
+          FMR or FastReg for memory registration.  Value needs to agree between
+          both peers of connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_pool_size</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>512</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Size of fmr pool on each CPT (>= ntx / 4).  Grows at runtime.
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_flush_trigger</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>384</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Number dirty FMRs that triggers pool flush.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>fmr_cache</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>1</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Non-zero to enable FMR caching.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>dev_failover</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
+          </para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>require_privileged_port</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>0</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Require privileged port when accepting connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>use_privileged_port</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>1</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Use privileged port when initiating connection.</para>
+               </entry>
+             </row>
+             <row>
+               <entry>
+                 <para>
+                   <literal>wrq_sge</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>
+                   <literal>2</literal>
+                 </para>
+               </entry>
+               <entry>
+                 <para>Introduced in 2.10. Number scatter/gather element groups per
+          work request.  Used to deal with fragmentations which can consume
+          double the number of work requests.</para>
+               </entry>
+             </row>
+           </tbody>
+         </tgroup>
+       </informaltable>
+    </section>
   </section>
   <section xml:id="dbdoclet.nrstuning" condition='l24'>
     <title>
@@ -640,6 +1099,18 @@ regular_requests:
     queued: 2420
     active: 268
 
+  - name: tbf
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: delay
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
 high_priority_requests:
   - name: fifo
     state: started
@@ -664,7 +1135,19 @@ high_priority_requests:
     fallback: no
     queued: 0
     active: 0
-      
+
+  - name: tbf
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
+  - name: delay
+    state: stopped
+    fallback: no
+    queued: 0
+    active: 0
+
 </screen>
     <para>NRS policy state is shown in either one or two sections, depending on
     the PTLRPC service being queried. The first section is named 
@@ -808,7 +1291,7 @@ ost.OSS.ost_io.nrs_policies="trr reg"
         <tertiary>first in, first out (FIFO) policy</tertiary>
       </indexterm>First In, First Out (FIFO) policy</title>
       <para>The first in, first out (FIFO) policy handles RPCs in a service in
-      the same order as they arrive from the LNET layer, so no special
+      the same order as they arrive from the LNet layer, so no special
       processing takes place to modify the RPC handling stream. FIFO is the
       default policy for all types of RPCs on all PTLRPC services, and is
       always enabled irrespective of the state of other policies, so that it
@@ -1102,7 +1585,7 @@ ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
         </listitem>
       </itemizedlist>
     </section>
-    <section condition='l26'>
+    <section xml:id="dbdoclet.tbftuning" condition='l26'>
       <title>
       <indexterm>
         <primary>tuning</primary>
@@ -1145,126 +1628,449 @@ ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
       knows its RPC token rate. A rule can be added to or removed from the list
       at run time. Whenever the list of rules is changed, the queues will
       update their matched rules.</para>
+      <section remap="h4">
+       <title>Enable TBF policy</title>
+       <para>Command:</para>
+       <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf &lt;<replaceable>policy</replaceable>&gt;"
+       </screen>
+       <para>For now, the RPCs can be classified into the different types
+       according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+       policy, you can specify one of the types, or just use "tbf" to enable
+       all of them to do a fine-grained RPC requests classification.</para>
+       <para>Example:</para>
+       <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
+      </section>
+      <section remap="h4">
+       <title>Start a TBF rule</title>
+       <para>The TBF rule is defined in the parameter
+       <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
+       <para>Command:</para>
+       <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
+       </screen>
+       <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
+       policy rule's name and '<replaceable>arguments</replaceable>' is a
+       string to specify the detailed rule according to the different types.
+       </para>
+       <itemizedlist>
+       <para>Next, the different types of TBF policies will be described.</para>
+         <listitem>
+           <para><emphasis role="bold">NID based TBF policy</emphasis></para>
+           <para>Command:</para>
+            <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+            <para>'<replaceable>nidlist</replaceable>' uses the same format
+           as configuring LNET route. '<replaceable>rate</replaceable>' is
+           the (upper limit) RPC rate of the rule.</para>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start other_clients nid={192.168.*.*@tcp} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+            <para>In this example, the rate of processing RPC requests from
+           compute nodes is at most 5x as fast as those from login nodes.
+           The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
+           like:</para>
+           <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0
+high_priority_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start loginnode nid={192.168.1.1@tcp} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
+            <para>For the JobID, please see
+            <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+            linkend="dbdoclet.jobstats" /> for more details.</para>
+           <para>Command:</para>
+            <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+           <para>Wildcard is supported in
+           {<replaceable>jobid_list</replaceable>}.</para>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start dd_user jobid={dd.*} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={*.600} rate=10"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user2 jobid={io*.10* *.500} rate=200"</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
+           <para>Command:</para>
+            <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+           </screen>
+            <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
+            <para>Also, the rule can be written in <literal>reg</literal> and
+           <literal>hp</literal> formats:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
+         </listitem>
+         <listitem>
+      <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+           <para>Command:</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+           <para>Exapmle:</para>
+           <para>Limit the rate of RPC requests of the uid 500</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+           <para>Limit the rate of RPC requests of the gid 500</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+           <para>Also, you can use the following rule to control all reqs
+           to mds:</para>
+           <para>Start the tbf uid QoS on MDS:</para>
+           <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+           <para>Limit the rate of RPC requests of the uid 500</para>
+           <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">Policy combination</emphasis></para>
+           <para>To support TBF rules with complex expressions of conditions,
+           TBF classifier is extented to classify RPC in a more fine-grained
+           way. This feature supports logical conditional conjunction and
+           disjunction operations among different types.
+           In the rule:
+           "&amp;" represents the conditional conjunction and
+           "," represents the conditional disjunction.</para>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start comp_rule opcode={ost_write}&amp;jobid={dd.0},\
+nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
+           <para>In this example, those RPCs whose <literal>opcode</literal> is
+           ost_write and <literal>jobid</literal> is dd.0, or
+           <literal>nid</literal> satisfies the condition of
+           {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
+           req/sec.
+           The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
+           </para>
+           <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0</screen>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&amp;gid={500} rate=100"</screen>
+           <para>In this example, those RPC requests whose uid is 500 and
+           gid is 500 will be processed at the rate of 100 req/sec.</para>
+         </listitem>
+       </itemizedlist>
+      </section>
+      <section remap="h4">
+          <title>Change a TBF rule</title>
+          <para>Command:</para>
+          <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
+          </screen>
+          <para>Example:</para>
+          <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp change loginnode rate=200"
+</screen>
+      </section>
+      <section remap="h4">
+          <title>Stop a TBF rule</title>
+          <para>Command:</para>
+          <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
+<replaceable>rule_name</replaceable>"</screen>
+          <para>Example:</para>
+          <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
+      </section>
+      <section remap="h4">
+        <title>Rule options</title>
+       <para>To support more flexible rule conditions, the following options
+       are added.</para>
+       <itemizedlist>
+         <listitem>
+           <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
+           <para>By default, a newly started rule is prior to the old ones,
+           but by specifying the argument '<literal>rank=</literal>' when
+           inserting a new rule with "<literal>start</literal>" command,
+           the rank of the rule can be changed. Also, it can be changed by
+           "<literal>change</literal>" command.
+           </para>
+           <para>Command:</para>
+           <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
+lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
+</screen>
+           <para>By specifying the existing rule
+           '<replaceable>obj_rule_name</replaceable>', the new rule
+           '<replaceable>rule_name</replaceable>' will be moved to the front of
+           '<replaceable>obj_rule_name</replaceable>'.</para>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={iozone.500 dd.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
+           <para>In this example, rule "iozone_user1" is added to the front of
+           rule "computes". We can see the order by the following command:
+           </para>
+           <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0</screen>
+         </listitem>
+         <listitem>
+           <para><emphasis role="bold">TBF realtime policies under congestion
+           </emphasis></para>
+           <para>During TBF evaluation, we find that when the sum of I/O
+           bandwidth requirements for all classes exceeds the system capacity,
+           the classes with the same rate limits get less bandwidth than if
+           preconfigured evenly. The reason for this is the heavy load on a
+           congested server will result in some missed deadlines for some
+           classes. The number of the calculated tokens may be larger than 1
+           during dequeuing. In the original implementation, all classes are
+           equally handled to simply discard exceeding tokens.</para>
+           <para>Thus, a Hard Token Compensation (HTC) strategy has been
+           implemented. A class can be configured with the HTC feature by the
+           rule it matches. This feature means that requests in this kind of
+           class queues have high real-time requirements and that the bandwidth
+           assignment must be satisfied as good as possible. When deadline
+           misses happen, the class keeps the deadline unchanged and the time
+           residue(the remainder of elapsed time divided by 1/r) is compensated
+           to the next round. This ensures that the next idle I/O thread will
+           always select this class to serve until all accumulated exceeding
+           tokens are handled or there are no pending requests in the class
+           queue.</para>
+           <para>Command:</para>
+           <para>A new command format is added to enable the realtime feature
+           for a rule:</para>
+           <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+           <para>Example:</para>
+           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+           <para>This example rule means the RPC requests whose JobID is dd.0
+           will be processed at the rate of 100req/sec in realtime.</para>
+         </listitem>
+       </itemizedlist>
+      </section>
+    </section>
+    <section xml:id="dbdoclet.delaytuning" condition='l2A'>
+      <title>
+      <indexterm>
+        <primary>tuning</primary>
+        <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+        <tertiary>Delay policy</tertiary>
+      </indexterm>Delay policy</title>
+      <para>The NRS Delay policy seeks to perturb the timing of request
+      processing at the PtlRPC layer, with the goal of simulating high server
+      load, and finding and exposing timing related problems. When this policy
+      is active, upon arrival of a request the policy will calculate an offset,
+      within a defined, user-configurable range, from the request arrival
+      time, to determine a time after which the request should be handled.
+      The request is then stored using the cfs_binheap implementation,
+      which sorts the request according to the assigned start time.
+      Requests are removed from the binheap for handling once their start
+      time has been passed.</para>
+      <para>The Delay policy can be enabled on all types of PtlRPC services,
+      and has the following tunables that can be used to adjust its behavior:
+      </para>
       <itemizedlist>
         <listitem>
           <para>
-            <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
+            <literal>{service}.nrs_delay_min</literal>
           </para>
-          <para>The format of the rule start command of TBF policy is as
-          follows:</para>
-          <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
-                  "[reg|hp] start 
-<replaceable>rule_name</replaceable> 
-<replaceable>arguments</replaceable>..."
-</screen>
-          <para>The '
-          <replaceable>rule_name</replaceable>' argument is a string which
-          identifies a rule. The format of the '
-          <replaceable>arguments</replaceable>' is changing according to the
-          type of the TBF policy. For the NID based TBF policy, its format is
-          as follows:</para>
-          <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
-                  "[reg|hp] start 
-<replaceable>rule_name</replaceable> {
-<replaceable>nidlist</replaceable>} 
-<replaceable>rate</replaceable>"
-</screen>
-          <para>The format of '
-          <replaceable>nidlist</replaceable>' argument is the same as the
-          format when configuring LNET route. The '
-          <replaceable>rate</replaceable>' argument is the RPC rate of the
-          rule, means the upper limit number of requests per second.</para>
-          <para>Following commands are valid. Please note that a newly started
-          rule is prior to old rules, so the order of starting rules is
-          critical too.</para>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "start other_clients {192.168.*.*@tcp} 50"
-</screen>
+          <para>The
+          <literal>{service}.nrs_delay_min</literal> tunable controls the
+          minimum amount of time, in seconds, that a request will be delayed by
+          this policy.  The default is 5 seconds. To read this value run:</para>
           <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "start loginnode {192.168.1.1@tcp} 100"
-</screen>
-          <para>General rule can be replaced by two rules (reg and hp) as
-          follows:</para>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "reg start loginnode {192.168.1.1@tcp} 100"
-</screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "hp start loginnode {192.168.1.1@tcp} 100"
-</screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "start computes {192.168.1.[2-128]@tcp} 500"
-</screen>
-          <para>The above rules will put an upper limit for servers to process
-          at most 5x as many RPCs from compute nodes as login nodes.</para>
-          <para>For the JobID (please see 
-          <xref xmlns:xlink="http://www.w3.org/1999/xlink"
-          linkend="dbdoclet.jobstats" />for more details) based TBF policy, its
-          format is as follows:</para>
-          <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
-                  "[reg|hp] start 
-<replaceable>name</replaceable> {
-<replaceable>jobid_list</replaceable>} 
-<replaceable>rate</replaceable>"
-</screen>
-          <para>Following commands are valid:</para>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "start user1 {iozone.500 dd.500} 100"
-</screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "start iozone_user1 {iozone.500} 100"
-</screen>
-          <para>Same as nid, could use reg and hp rules separately:</para>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "hp start iozone_user1 {iozone.500} 100"
-</screen>
+lctl get_param {service}.nrs_delay_min</screen>
+          <para>For example, to read the minimum delay set on the ost_io
+          service, run:</para>
           <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
-                  "reg start iozone_user1 {iozone.500} 100"
-</screen>
-          <para>The format of the rule change command of TBF policy is as
-          follows:</para>
-          <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
-                  "[reg|hp] change 
-<replaceable>rule_name</replaceable> 
-<replaceable>rate</replaceable>"
-</screen>
-          <para>Following commands are valid:</para>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
-</screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
-</screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
-</screen>
-          <para>The format of the rule stop command of TBF policy is as
-          follows:</para>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_min
+ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
+hp_delay_min:5</screen>
+        <para>To set the minimum delay in RPC processing, run:</para>
+        <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
+        <para>This will set the minimum delay time on a given service, for both
+        regular and high-priority RPCs (if the PtlRPC service supports
+        high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the minimum delay time on the ost_io service
+        to 10, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
+ost.OSS.ost_io.nrs_delay_min=10</screen>
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different minimum delay time for regular and high-priority RPCs, run:
+        </para>
+        <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
+        </screen>
+        <para>For example, to set the minimum delay time on the ost_io service
+        for high-priority RPCs to 3, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
+ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
+        <para>Note, in all cases the minimum delay time cannot exceed the
+        maximum delay time.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_delay_max</literal>
+          </para>
+          <para>The
+          <literal>{service}.nrs_delay_max</literal> tunable controls the
+          maximum amount of time, in seconds, that a request will be delayed by
+          this policy.  The default is 300 seconds. To read this value run:
+          </para>
+          <screen>lctl get_param {service}.nrs_delay_max</screen>
+          <para>For example, to read the maximum delay set on the ost_io
+          service, run:</para>
           <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop 
-<replaceable>rule_name</replaceable>"
+$ lctl get_param ost.OSS.ost_io.nrs_delay_max
+ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
+hp_delay_max:300</screen>
+        <para>To set the maximum delay in RPC processing, run:</para>
+        <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
 </screen>
-          <para>Following commands are valid:</para>
+        <para>This will set the maximum delay time on a given service, for both
+        regular and high-priority RPCs (if the PtlRPC service supports
+        high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the maximum delay time on the ost_io service
+        to 60, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
+ost.OSS.ost_io.nrs_delay_max=60</screen>
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different maximum delay time for regular and high-priority RPCs, run:
+        </para>
+        <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
+        <para>For example, to set the maximum delay time on the ost_io service
+        for high-priority RPCs to 30, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
+ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
+        <para>Note, in all cases the maximum delay time cannot be less than the
+        minimum delay time.</para>
+        </listitem>
+        <listitem>
+          <para>
+            <literal>{service}.nrs_delay_pct</literal>
+          </para>
+          <para>The
+          <literal>{service}.nrs_delay_pct</literal> tunable controls the
+          percentage of requests that will be delayed by this policy. The
+          default is 100. Note, when a request is not selected for handling by
+          the delay policy due to this variable then the request will be handled
+          by whatever fallback policy is defined for that service. If no other
+          fallback policy is defined then the request will be handled by the
+          FIFO policy.  To read this value run:</para>
+          <screen>lctl get_param {service}.nrs_delay_pct</screen>
+          <para>For example, to read the percentage of requests being delayed on
+          the ost_io service, run:</para>
           <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl get_param ost.OSS.ost_io.nrs_delay_pct
+ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
+hp_delay_pct:100</screen>
+        <para>To set the percentage of delayed requests, run:</para>
+        <screen>
+lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
+        <para>This will set the percentage of requests delayed on a given
+        service, for both regular and high-priority RPCs (if the PtlRPC service
+        supports high-priority RPCs), to the indicated value.</para>
+        <para>For example, to set the percentage of delayed requests on the
+        ost_io service to 50, run:</para>
+        <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
+ost.OSS.ost_io.nrs_delay_pct=50
 </screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+        <para>For PtlRPC services that support high-priority RPCs, to set a
+        different delay percentage for regular and high-priority RPCs, run:
+        </para>
+        <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
 </screen>
-          <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
+        <para>For example, to set the percentage of delayed requests on the
+        ost_io service for high-priority RPCs to 5, run:</para>
+        <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
 </screen>
         </listitem>
       </itemizedlist>
@@ -1277,8 +2083,8 @@ $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
       <secondary>lockless I/O</secondary>
     </indexterm>Lockless I/O Tunables</title>
     <para>The lockless I/O tunable feature allows servers to ask clients to do
-    lockless I/O (liblustre-style where the server does the locking) on
-    contended files.</para>
+    lockless I/O (the server does the locking on behalf of clients) for
+    contended files to avoid lock ping-pong.</para>
     <para>The lockless I/O patch introduces these tunables:</para>
     <itemizedlist>
       <listitem>
@@ -1286,7 +2092,7 @@ $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
           <emphasis role="bold">OST-side:</emphasis>
         </para>
         <screen>
-/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
+ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
 </screen>
         <para>
         <literal>contended_locks</literal>- If the number of lock conflicts in
@@ -1297,9 +2103,9 @@ $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
         contended state as set in the parameter.</para>
         <para>
         <literal>max_nolock_bytes</literal>- Server-side locking set only for
-        requests less than the blocks set in the 
-        <literal>max_nolock_bytes</literal> parameter. If this tunable is set to
-        zero (0), it disables server-side locking for read/write
+        requests less than the blocks set in the
+        <literal>max_nolock_bytes</literal> parameter. If this tunable is
+        set to zero (0), it disables server-side locking for read/write
         requests.</para>
       </listitem>
       <listitem>
@@ -1332,28 +2138,253 @@ $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
       </listitem>
     </itemizedlist>
   </section>
+  <section condition="l29">
+      <title>
+        <indexterm>
+          <primary>tuning</primary>
+          <secondary>with lfs ladvise</secondary>
+        </indexterm>
+        Server-Side Advice and Hinting
+      </title>
+      <section><title>Overview</title>
+      <para>Use the <literal>lfs ladvise</literal> command to give file access
+      advices or hints to servers.</para>
+      <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
+[--start|-s START[kMGT]]
+{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
+<emphasis>file</emphasis> ...
+      </screen>
+      <para>
+        <informaltable frame="all">
+          <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Option</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para><literal>-a</literal>, <literal>--advice=</literal>
+                <literal>ADVICE</literal></para>
+              </entry>
+              <entry>
+                <para>Give advice or hint of type <literal>ADVICE</literal>.
+                Advice types are:</para>
+                <para><literal>willread</literal> to prefetch data into server
+                cache</para>
+                <para><literal>dontneed</literal> to cleanup data cache on
+                server</para>
+                <para><literal>lockahead</literal> Request an LDLM extent lock
+                of the given mode on the given byte range </para>
+                <para><literal>noexpand</literal> Disable extent lock expansion
+                behavior for I/O to this file descriptor</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para><literal>-b</literal>, <literal>--background</literal>
+                </para>
+              </entry>
+              <entry>
+                <para>Enable the advices to be sent and handled asynchronously.
+                </para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para><literal>-s</literal>, <literal>--start=</literal>
+                        <literal>START_OFFSET</literal></para>
+              </entry>
+              <entry>
+                <para>File range starts from <literal>START_OFFSET</literal>
+                </para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-e</literal>, <literal>--end=</literal>
+                        <literal>END_OFFSET</literal></para>
+                </entry>
+                <entry>
+                    <para>File range ends at (not including)
+                    <literal>END_OFFSET</literal>.  This option may not be
+                    specified at the same time as the <literal>-l</literal>
+                    option.</para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-l</literal>, <literal>--length=</literal>
+                        <literal>LENGTH</literal></para>
+                </entry>
+                <entry>
+                  <para>File range has length of <literal>LENGTH</literal>.
+                  This option may not be specified at the same time as the
+                  <literal>-e</literal> option.</para>
+                </entry>
+            </row>
+            <row>
+                <entry>
+                    <para><literal>-m</literal>, <literal>--mode=</literal>
+                        <literal>MODE</literal></para>
+                </entry>
+                <entry>
+                  <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
+                  Request a lock with this mode.</para>
+                </entry>
+            </row>
+          </tbody>
+          </tgroup>
+        </informaltable>
+      </para>
+      <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
+      Lustre servers without guaranteeing when and what servers will react to
+      the advice. Actions may or may not triggered when the advices are
+      recieved, depending on the type of the advice, as well as the real-time
+      decision of the affected server-side components.</para>
+      <para>A typical usage of ladvise is to enable applications and users with
+      external knowledge to intervene in server-side cache management. For
+      example, if a bunch of different clients are doing small random reads of a
+      file, prefetching pages into OSS cache with big linear reads before the
+      random IO is a net benefit. Fetching that data into each client cache with
+      fadvise() may not be, due to much more data being sent to the client.
+      </para>
+      <para>
+      <literal>ladvise lockahead</literal> is different in that it attempts to
+      control LDLM locking behavior by explicitly requesting LDLM locks in
+      advance of use.  This does not directly affect caching behavior, instead
+      it is used in special cases to avoid pathological results (lock exchange)
+      from the normal LDLM locking behavior.
+      </para>
+      <para>
+      Note that the <literal>noexpand</literal> advice works on a specific
+      file descriptor, so using it via lfs has no effect.  It must be used
+      on a particular file descriptor which is used for i/o to have any effect.
+      </para>
+      <para>The main difference between the Linux <literal>fadvise()</literal>
+      system call and <literal>lfs ladvise</literal> is that
+      <literal>fadvise()</literal> is only a client side mechanism that does
+      not pass the advice to the filesystem, while <literal>ladvise</literal>
+      can send advices or hints to the Lustre server side.</para>
+      </section>
+      <section><title>Examples</title>
+        <para>The following example gives the OST(s) holding the first 1GB of
+        <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
+        file will be read soon.</para>
+        <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
+        </screen>
+        <para>The following example gives the OST(s) holding the first 1GB of
+        <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
+        will not be read in the near future, thus the OST(s) could clear the
+        cache of the file in the memory.</para>
+        <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
+        </screen>
+        <para>The following example requests an LDLM read lock on the first
+       1 MiB of <literal>/mnt/lustre/file1</literal>.  This will attempt to
+       request a lock from the OST holding that region of the file.</para>
+        <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
+        </screen>
+        <para>The following example requests an LDLM write lock on
+       [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>.  This will
+       attempt to request a lock from the OST holding that region of the
+       file.</para>
+        <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
+        </screen>
+      </section>
+  </section>
+  <section condition="l29">
+      <title>
+          <indexterm>
+              <primary>tuning</primary>
+              <secondary>Large Bulk IO</secondary>
+          </indexterm>
+          Large Bulk IO (16MB RPC)
+      </title>
+      <section><title>Overview</title>
+          <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
+          to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
+          required to transfer the same amount of data between clients and
+          servers.  With a larger RPC size, the OSS can submit more data to the
+          underlying disks at once, therefore it can produce larger disk I/Os
+          to fully utilize the increasing bandwidth of disks.</para>
+          <para>At client connection time, clients will negotiate with
+          servers what the maximum RPC size it is possible to use, but the
+         client can always send RPCs smaller than this maximum.</para>
+          <para>The parameter <literal>brw_size</literal> is used on the OST
+         to tell the client the maximum (preferred) IO size.  All clients that
+          talk to this target should never send an RPC greater than this size.
+         Clients can individually set a smaller RPC size limit via the
+         <literal>osc.*.max_pages_per_rpc</literal> tunable.
+          </para>
+         <note>
+         <para>The smallest <literal>brw_size</literal> that can be set for
+         ZFS OSTs is the <literal>recordsize</literal> of that dataset.  This
+         ensures that the client can always write a full ZFS file block if it
+         has enough dirty data, and does not otherwise force it to do read-
+         modify-write operations for every RPC.
+          </para>
+         </note>
+      </section>
+      <section><title>Usage</title>
+          <para>In order to enable a larger RPC size,
+          <literal>brw_size</literal> must be changed to an IO size value up to
+          16MB.  To temporarily change <literal>brw_size</literal>, the
+          following command should be run on the OSS:</para>
+          <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
+          <para>To persistently change <literal>brw_size</literal>, one of the following
+          commands should be run on the OSS:</para>
+          <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
+          <screen>oss# lctl conf_param <replaceable>fsname</replaceable>-OST*.obdfilter.brw_size=16</screen>
+          <para>When a client connects to an OST target, it will fetch
+          <literal>brw_size</literal> from the target and pick the maximum value
+          of <literal>brw_size</literal> and its local setting for
+          <literal>max_pages_per_rpc</literal> as the actual RPC size.
+          Therefore, the <literal>max_pages_per_rpc</literal> on the client side
+          would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
+          a 16MB RPC.  To temporarily make the change, the following command
+          should be run on the client to set
+          <literal>max_pages_per_rpc</literal>:</para>
+          <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
+          <para>To persistently make this change, the following command should
+          be run:</para>
+          <screen>client$ lctl conf_param <replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
+          <caution><para>The <literal>brw_size</literal> of an OST can be
+          changed on the fly.  However, clients have to be remounted to
+          renegotiate the new maximum RPC size.</para></caution>
+      </section>
+  </section>
   <section xml:id="dbdoclet.50438272_80545">
     <title>
     <indexterm>
       <primary>tuning</primary>
       <secondary>for small files</secondary>
-    </indexterm>Improving Lustre File System Performance When Working with
-    Small Files</title>
+    </indexterm>Improving Lustre I/O Performance for Small Files</title>
     <para>An environment where an application writes small file chunks from
-    many clients to a single file will result in bad I/O performance. To
+    many clients to a single file can result in poor I/O performance. To
     improve the performance of the Lustre file system with small files:</para>
     <itemizedlist>
       <listitem>
         <para>Have the application aggregate writes some amount before
         submitting them to the Lustre file system. By default, the Lustre
         software enforces POSIX coherency semantics, so it results in lock
-        ping-pong between client nodes if they are all writing to the same file
-        at one time.</para>
+        ping-pong between client nodes if they are all writing to the same
+        file at one time.</para>
+        <para>Using MPI-IO Collective Write functionality in
+        the Lustre ADIO driver is one way to achieve this in a straight
+        forward manner if the application is already using MPI-IO.</para>
       </listitem>
       <listitem>
-        <para>Have the application do 4kB 
-        <literal>O_DIRECT</literal> sized I/O to the file and disable locking on
-        the output file. This avoids partial-page IO submissions and, by
+        <para>Have the application do 4kB
+        <literal>O_DIRECT</literal> sized I/O to the file and disable locking
+        on the output file. This avoids partial-page IO submissions and, by
         disabling locking, you avoid contention between clients.</para>
       </listitem>
       <listitem>