LustreTuning.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="lustretuning">
   5   <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
   6   <para>This chapter contains information about tuning a Lustre file system for
   7   better performance.</para>
   8   <note>
   9     <para>Many options in the Lustre software are set by means of kernel module
  10     parameters. These parameters are contained in the
  11     <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
  12   </note>
  13   <section xml:id="dbdoclet.50438272_55226">
  14     <title>
  15     <indexterm>
  16       <primary>tuning</primary>
  17     </indexterm>
  18     <indexterm>
  19       <primary>tuning</primary>
  20       <secondary>service threads</secondary>
  21     </indexterm>Optimizing the Number of Service Threads</title>
  22     <para>An OSS can have a minimum of two service threads and a maximum of 512
  23     service threads. The number of service threads is a function of how much
  24     RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
  25     If the load on the OSS node is high, new service threads will be started in
  26     order to process more requests concurrently, up to 4x the initial number of
  27     threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
  28     default thread count is 32 and the maximum thread count is 128.</para>
  29     <para>Increasing the size of the thread pool may help when:</para>
  30     <itemizedlist>
  31       <listitem>
  32         <para>Several OSTs are exported from a single OSS</para>
  33       </listitem>
  34       <listitem>
  35         <para>Back-end storage is running synchronously</para>
  36       </listitem>
  37       <listitem>
  38         <para>I/O completions take excessive time due to slow storage</para>
  39       </listitem>
  40     </itemizedlist>
  41     <para>Decreasing the size of the thread pool may help if:</para>
  42     <itemizedlist>
  43       <listitem>
  44         <para>Clients are overwhelming the storage capacity</para>
  45       </listitem>
  46       <listitem>
  47         <para>There are lots of "slow I/O" or similar messages</para>
  48       </listitem>
  49     </itemizedlist>
  50     <para>Increasing the number of I/O threads allows the kernel and storage to
  51     aggregate many writes together for more efficient disk I/O. The OSS thread
  52     pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
  53     size + 0.5 MB) for internal I/O buffers.</para>
  54     <para>It is very important to consider memory consumption when increasing
  55     the thread pool size. Drives are only able to sustain a certain amount of
  56     parallel I/O activity before performance is degraded, due to the high
  57     number of seeks and the OST threads just waiting for I/O. In this
  58     situation, it may be advisable to decrease the load by decreasing the
  59     number of OST threads.</para>
  60     <para>Determining the optimum number of OSS threads is a process of trial
  61     and error, and varies for each particular configuration. Variables include
  62     the number of OSTs on each OSS, number and speed of disks, RAID
  63     configuration, and available RAM. You may want to start with a number of
  64     OST threads equal to the number of actual disk spindles on the node. If you
  65     use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
  66     of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
  67     performance of clients during usual workloads. If performance is degraded,
  68     increase the thread count and see how that works until performance is
  69     degraded again or you reach satisfactory performance.</para>
  70     <note>
  71       <para>If there are too many threads, the latency for individual I/O
  72       requests can become very high and should be avoided. Set the desired
  73       maximum thread count permanently using the method described above.</para>
  74     </note>
  75     <section>
  76       <title>
  77       <indexterm>
  78         <primary>tuning</primary>
  79         <secondary>OSS threads</secondary>
  80       </indexterm>Specifying the OSS Service Thread Count</title>
  81       <para>The
  82       <literal>oss_num_threads</literal> parameter enables the number of OST
  83       service threads to be specified at module load time on the OSS
  84       nodes:</para>
  85       <screen>
  86 options ost oss_num_threads={N}
  87 </screen>
  88       <para>After startup, the minimum and maximum number of OSS thread counts
  89       can be set via the
  90       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
  91       the tunable at runtime, run:</para>
  92       <para>
  93         <screen>
  94 lctl {get,set}_param {service}.thread_{min,max,started}
  95 </screen>
  96       </para>
  97       <para>
  98       This works in a similar fashion to
  99       binding of threads on MDS. MDS thread tuning is covered in
 100       <xref linkend="dbdoclet.mdsbinding" />.</para>
 101       <itemizedlist>
 102         <listitem>
 103           <para>
 104           <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
 105           on CPTs defined by
 106           <literal>[EXPRESSION]</literal>.</para>
 107         </listitem>
 108         <listitem>
 109           <para>
 110           <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
 111           on CPTs defined by
 112           <literal>[EXPRESSION]</literal>.</para>
 113         </listitem>
 114       </itemizedlist>
 115       <para>For further details, see
 116       <xref linkend="dbdoclet.50438271_87260" />.</para>
 117     </section>
 118     <section xml:id="dbdoclet.mdstuning">
 119       <title>
 120       <indexterm>
 121         <primary>tuning</primary>
 122         <secondary>MDS threads</secondary>
 123       </indexterm>Specifying the MDS Service Thread Count</title>
 124       <para>The
 125       <literal>mds_num_threads</literal> parameter enables the number of MDS
 126       service threads to be specified at module load time on the MDS
 127       node:</para>
 128       <screen>options mds mds_num_threads={N}</screen>
 129       <para>After startup, the minimum and maximum number of MDS thread counts
 130       can be set via the
 131       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
 132       the tunable at runtime, run:</para>
 133       <para>
 134         <screen>
 135 lctl {get,set}_param {service}.thread_{min,max,started}
 136 </screen>
 137       </para>
 138       <para>For details, see
 139       <xref linkend="dbdoclet.50438271_87260" />.</para>
 140       <para>The number of MDS service threads started depends on system size
 141       and the load on the server, and has a default maximum of 64. The
 142       maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
 143       is 1024.</para>
 144       <note>
 145         <para>The OSS and MDS start two threads per service per CPT at mount
 146         time, and dynamically increase the number of running service threads in
 147         response to server load. Setting the <literal>*_num_threads</literal>
 148         module parameter starts the specified number of threads for that
 149         service immediately and disables automatic thread creation behavior.
 150         </para>
 151       </note>
 152       <para>Parameters are available to provide administrators control
 153         over the number of service threads.</para>
 154       <itemizedlist>
 155         <listitem>
 156           <para>
 157           <literal>mds_rdpg_num_threads</literal> controls the number of threads
 158           in providing the read page service. The read page service handles
 159           file close and readdir operations.</para>
 160         </listitem>
 161         <listitem>
 162           <para>
 163           <literal>mds_attr_num_threads</literal> controls the number of threads
 164           in providing the setattr service to clients running Lustre software
 165           release 1.8.</para>
 166         </listitem>
 167       </itemizedlist>
 168     </section>
 169   </section>
 170   <section xml:id="dbdoclet.mdsbinding">
 171     <title>
 172     <indexterm>
 173       <primary>tuning</primary>
 174       <secondary>MDS binding</secondary>
 175     </indexterm>Binding MDS Service Thread to CPU Partitions</title>
 176     <para>With the Node Affinity (<xref linkend="nodeaffdef" />) feature,
 177     MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU
 178     cache usage and memory locality.  Default values for CPT counts and CPU core
 179     bindings are selected automatically to provide good overall performance for
 180     a given CPU count. However, an administrator can deviate from these setting
 181     if they choose.  For details on specifying the mapping of CPU cores to
 182     CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
 183     </para>
 184     <itemizedlist>
 185       <listitem>
 186         <para>
 187         <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
 188         service threads to CPTs defined by
 189         <literal>EXPRESSION</literal>. For example
 190         <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
 191         to
 192         <literal>CPT[0,1,2,3]</literal>.</para>
 193       </listitem>
 194       <listitem>
 195         <para>
 196         <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
 197         service threads to CPTs defined by
 198         <literal>EXPRESSION</literal>. The read page service handles file close
 199         and readdir requests. For example
 200         <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
 201         to
 202         <literal>CPT4</literal>.</para>
 203       </listitem>
 204       <listitem>
 205         <para>
 206         <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
 207         service threads to CPTs defined by
 208         <literal>EXPRESSION</literal>.</para>
 209       </listitem>
 210     </itemizedlist>
 211         <para>Parameters must be set before module load in the file
 212     <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
 213     <example><title>lustre.conf</title>
 214     <screen>options lnet networks=tcp0(eth0)
 215 options mdt mds_num_cpts=[0]</screen>
 216     </example>
 217     </para>
 218   </section>
 219   <section xml:id="dbdoclet.50438272_73839">
 220     <title>
 221     <indexterm>
 222       <primary>LNet</primary>
 223       <secondary>tuning</secondary>
 224     </indexterm>
 225     <indexterm>
 226       <primary>tuning</primary>
 227       <secondary>LNet</secondary>
 228     </indexterm>Tuning LNet Parameters</title>
 229     <para>This section describes LNet tunables, the use of which may be
 230     necessary on some systems to improve performance. To test the performance
 231     of your Lustre network, see
 232     <xref linkend='lnetselftest' />.</para>
 233     <section remap="h3">
 234       <title>Transmit and Receive Buffer Size</title>
 235       <para>The kernel allocates buffers for sending and receiving messages on
 236       a network.</para>
 237       <para>
 238       <literal>ksocklnd</literal> has separate parameters for the transmit and
 239       receive buffers.</para>
 240       <screen>
 241 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
 242 </screen>
 243       <para>If these parameters are left at the default value (0), the system
 244       automatically tunes the transmit and receive buffer size. In almost every
 245       case, this default produces the best performance. Do not attempt to tune
 246       these parameters unless you are a network expert.</para>
 247     </section>
 248     <section remap="h3">
 249       <title>Hardware Interrupts (
 250       <literal>enable_irq_affinity</literal>)</title>
 251       <para>The hardware interrupts that are generated by network adapters may
 252       be handled by any CPU in the system. In some cases, we would like network
 253       traffic to remain local to a single CPU to help keep the processor cache
 254       warm and minimize the impact of context switches. This is helpful when an
 255       SMP system has more than one network interface and ideal when the number
 256       of interfaces equals the number of CPUs. To enable the
 257       <literal>enable_irq_affinity</literal> parameter, enter:</para>
 258       <screen>
 259 options ksocklnd enable_irq_affinity=1
 260 </screen>
 261       <para>In other cases, if you have an SMP platform with a single fast
 262       interface such as 10 Gb Ethernet and more than two CPUs, you may see
 263       performance improve by turning this parameter off.</para>
 264       <screen>
 265 options ksocklnd enable_irq_affinity=0
 266 </screen>
 267       <para>By default, this parameter is off. As always, you should test the
 268       performance to compare the impact of changing this parameter.</para>
 269     </section>
 270     <section>
 271       <title>
 272       <indexterm>
 273         <primary>tuning</primary>
 274         <secondary>Network interface binding</secondary>
 275       </indexterm>Binding Network Interface Against CPU Partitions</title>
 276       <para>Lustre allows enhanced network interface control. This means that
 277       an administrator can bind an interface to one or more CPU partitions.
 278       Bindings are specified as options to the LNet modules. For more
 279       information on specifying module options, see
 280       <xref linkend="dbdoclet.50438293_15350" /></para>
 281       <para>For example,
 282       <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
 283       <literal>o2ib0</literal> will be handled by LND threads executing on
 284       <literal>CPT0</literal> and
 285       <literal>CPT1</literal>. An additional example might be:
 286       <literal>tcp1(eth0)[0]</literal>. Messages for
 287       <literal>tcp1</literal> are handled by threads on
 288       <literal>CPT0</literal>.</para>
 289     </section>
 290     <section>
 291       <title>
 292       <indexterm>
 293         <primary>tuning</primary>
 294         <secondary>Network interface credits</secondary>
 295       </indexterm>Network Interface Credits</title>
 296       <para>Network interface (NI) credits are shared across all CPU partitions
 297       (CPT). For example, if a machine has four CPTs and the number of NI
 298       credits is 512, then each partition has 128 credits. If a large number of
 299       CPTs exist on the system, LNet checks and validates the NI credits for
 300       each CPT to ensure each CPT has a workable number of credits. For
 301       example, if a machine has 16 CPTs and the number of NI credits is 256,
 302       then each partition only has 16 credits. 16 NI credits is low and could
 303       negatively impact performance. As a result, LNet automatically adjusts
 304       the credits to 8*
 305       <literal>peer_credits</literal>(
 306       <literal>peer_credits</literal> is 8 by default), so each partition has 64
 307       credits.</para>
 308       <para>Increasing the number of
 309       <literal>credits</literal>/
 310       <literal>peer_credits</literal> can improve the performance of high
 311       latency networks (at the cost of consuming more memory) by enabling LNet
 312       to send more inflight messages to a specific network/peer and keep the
 313       pipeline saturated.</para>
 314       <para>An administrator can modify the NI credit count using
 315       <literal>ksoclnd</literal> or
 316       <literal>ko2iblnd</literal>. In the example below, 256 credits are
 317       applied to TCP connections.</para>
 318       <screen>
 319 ksocklnd credits=256
 320 </screen>
 321       <para>Applying 256 credits to IB connections can be achieved with:</para>
 322       <screen>
 323 ko2iblnd credits=256
 324 </screen>
 325       <note>
 326         <para>LNet may revalidate the NI credits, so the administrator's
 327         request may not persist.</para>
 328       </note>
 329     </section>
 330     <section>
 331       <title>
 332       <indexterm>
 333         <primary>tuning</primary>
 334         <secondary>router buffers</secondary>
 335       </indexterm>Router Buffers</title>
 336       <para>When a node is set up as an LNet router, three pools of buffers are
 337       allocated: tiny, small and large. These pools are allocated per CPU
 338       partition and are used to buffer messages that arrive at the router to be
 339       forwarded to the next hop. The three different buffer sizes accommodate
 340       different size messages.</para>
 341       <para>If a message arrives that can fit in a tiny buffer then a tiny
 342       buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
 343       small buffer, then a small buffer is used. Finally if a message does not
 344       fit in either a tiny buffer or a small buffer, a large buffer is
 345       used.</para>
 346       <para>Router buffers are shared by all CPU partitions. For a machine with
 347       a large number of CPTs, the router buffer number may need to be specified
 348       manually for best performance. A low number of router buffers risks
 349       starving the CPU partitions of resources.</para>
 350       <itemizedlist>
 351         <listitem>
 352           <para>
 353           <literal>tiny_router_buffers</literal>: Zero payload buffers used for
 354           signals and acknowledgements.</para>
 355         </listitem>
 356         <listitem>
 357           <para>
 358           <literal>small_router_buffers</literal>: 4 KB payload buffers for
 359           small messages</para>
 360         </listitem>
 361         <listitem>
 362           <para>
 363           <literal>large_router_buffers</literal>: 1 MB maximum payload
 364           buffers, corresponding to the recommended RPC size of 1 MB.</para>
 365         </listitem>
 366       </itemizedlist>
 367       <para>The default setting for router buffers typically results in
 368       acceptable performance. LNet automatically sets a default value to reduce
 369       the likelihood of resource starvation. The size of a router buffer can be
 370       modified as shown in the example below. In this example, the size of the
 371       large buffer is modified using the
 372       <literal>large_router_buffers</literal> parameter.</para>
 373       <screen>
 374 lnet large_router_buffers=8192
 375 </screen>
 376       <note>
 377         <para>LNet may revalidate the router buffer setting, so the
 378         administrator's request may not persist.</para>
 379       </note>
 380     </section>
 381     <section>
 382       <title>
 383       <indexterm>
 384         <primary>tuning</primary>
 385         <secondary>portal round-robin</secondary>
 386       </indexterm>Portal Round-Robin</title>
 387       <para>Portal round-robin defines the policy LNet applies to deliver
 388       events and messages to the upper layers. The upper layers are PLRPC
 389       service or LNet selftest.</para>
 390       <para>If portal round-robin is disabled, LNet will deliver messages to
 391       CPTs based on a hash of the source NID. Hence, all messages from a
 392       specific peer will be handled by the same CPT. This can reduce data
 393       traffic between CPUs. However, for some workloads, this behavior may
 394       result in poorly balancing loads across the CPU.</para>
 395       <para>If portal round-robin is enabled, LNet will round-robin incoming
 396       events across all CPTs. This may balance load better across the CPU but
 397       can incur a cross CPU overhead.</para>
 398       <para>The current policy can be changed by an administrator with
 399       <literal>echo
 400       <replaceable>value</replaceable>&gt;
 401       /proc/sys/lnet/portal_rotor</literal>. There are four options for
 402       <literal>
 403         <replaceable>value</replaceable>
 404       </literal>:</para>
 405       <itemizedlist>
 406         <listitem>
 407           <para>
 408             <literal>OFF</literal>
 409           </para>
 410           <para>Disable portal round-robin on all incoming requests.</para>
 411         </listitem>
 412         <listitem>
 413           <para>
 414             <literal>ON</literal>
 415           </para>
 416           <para>Enable portal round-robin on all incoming requests.</para>
 417         </listitem>
 418         <listitem>
 419           <para>
 420             <literal>RR_RT</literal>
 421           </para>
 422           <para>Enable portal round-robin only for routed messages.</para>
 423         </listitem>
 424         <listitem>
 425           <para>
 426             <literal>HASH_RT</literal>
 427           </para>
 428           <para>Routed messages will be delivered to the upper layer by hash of
 429           source NID (instead of NID of router.) This is the default
 430           value.</para>
 431         </listitem>
 432       </itemizedlist>
 433     </section>
 434     <section>
 435       <title>LNet Peer Health</title>
 436       <para>Two options are available to help determine peer health:
 437       <itemizedlist>
 438         <listitem>
 439           <para>
 440           <literal>peer_timeout</literal>- The timeout (in seconds) before an
 441           aliveness query is sent to a peer. For example, if
 442           <literal>peer_timeout</literal> is set to
 443           <literal>180sec</literal>, an aliveness query is sent to the peer
 444           every 180 seconds. This feature only takes effect if the node is
 445           configured as an LNet router.</para>
 446           <para>In a routed environment, the
 447           <literal>peer_timeout</literal> feature should always be on (set to a
 448           value in seconds) on routers. If the router checker has been enabled,
 449           the feature should be turned off by setting it to 0 on clients and
 450           servers.</para>
 451           <para>For a non-routed scenario, enabling the
 452           <literal>peer_timeout</literal> option provides health information
 453           such as whether a peer is alive or not. For example, a client is able
 454           to determine if an MGS or OST is up when it sends it a message. If a
 455           response is received, the peer is alive; otherwise a timeout occurs
 456           when the request is made.</para>
 457           <para>In general,
 458           <literal>peer_timeout</literal> should be set to no less than the LND
 459           timeout setting. For more information about LND timeouts, see
 460           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 461           linkend="section_c24_nt5_dl" />.</para>
 462           <para>When the
 463           <literal>o2iblnd</literal>(IB) driver is used,
 464           <literal>peer_timeout</literal> should be at least twice the value of
 465           the
 466           <literal>ko2iblnd</literal> keepalive option. for more information
 467           about keepalive options, see
 468           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 469           linkend="section_ngq_qhy_zl" />.</para>
 470         </listitem>
 471         <listitem>
 472           <para>
 473           <literal>avoid_asym_router_failure</literal>– When set to 1, the
 474           router checker running on the client or a server periodically pings
 475           all the routers corresponding to the NIDs identified in the routes
 476           parameter setting on the node to determine the status of each router
 477           interface. The default setting is 1. (For more information about the
 478           LNet routes parameter, see
 479           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 480           linkend="lnet_module_routes" /></para>
 481           <para>A router is considered down if any of its NIDs are down. For
 482           example, router X has three NIDs:
 483           <literal>Xnid1</literal>,
 484           <literal>Xnid2</literal>, and
 485           <literal>Xnid3</literal>. A client is connected to the router via
 486           <literal>Xnid1</literal>. The client has router checker enabled. The
 487           router checker periodically sends a ping to the router via
 488           <literal>Xnid1</literal>. The router responds to the ping with the
 489           status of each of its NIDs. In this case, it responds with
 490           <literal>Xnid1=up</literal>,
 491           <literal>Xnid2=up</literal>,
 492           <literal>Xnid3=down</literal>. If
 493           <literal>avoid_asym_router_failure==1</literal>, the router is
 494           considered down if any of its NIDs are down, so router X is
 495           considered down and will not be used for routing messages. If
 496           <literal>avoid_asym_router_failure==0</literal>, router X will
 497           continue to be used for routing messages.</para>
 498         </listitem>
 499       </itemizedlist></para>
 500       <para>The following router checker parameters must be set to the maximum
 501       value of the corresponding setting for this option on any client or
 502       server:
 503       <itemizedlist>
 504         <listitem>
 505           <para>
 506             <literal>dead_router_check_interval</literal>
 507           </para>
 508         </listitem>
 509         <listitem>
 510           <para>
 511             <literal>live_router_check_interval</literal>
 512           </para>
 513         </listitem>
 514         <listitem>
 515           <para>
 516             <literal>router_ping_timeout</literal>
 517           </para>
 518         </listitem>
 519       </itemizedlist></para>
 520       <para>For example, the
 521       <literal>dead_router_check_interval</literal> parameter on any router must
 522       be MAX.</para>
 523     </section>
 524   </section>
 525   <section xml:id="dbdoclet.libcfstuning">
 526     <title>
 527     <indexterm>
 528       <primary>tuning</primary>
 529       <secondary>libcfs</secondary>
 530     </indexterm>libcfs Tuning</title>
 531     <para>Lustre allows binding service threads via CPU Partition Tables
 532       (CPTs). This allows the system administrator to fine-tune on which CPU
 533       cores the Lustre service threads are run, for both OSS and MDS services,
 534       as well as on the client.
 535     </para>
 536     <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
 537     system functions such as system monitoring, HA heartbeat, or similar
 538     tasks.  On the client it may be useful to restrict Lustre RPC service
 539     threads to a small subset of cores so that they do not interfere with
 540     computation, or because these cores are directly attached to the network
 541     interfaces.
 542     </para>
 543     <para>By default, the Lustre software will automatically generate CPU
 544     partitions (CPT) based on the number of CPUs in the system.
 545     The CPT count can be explicitly set on the libcfs module using
 546     <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
 547     The value of <literal>cpu_npartitions</literal> must be an integer between
 548     1 and the number of online CPUs.
 549     </para>
 550     <para condition='l29'>In Lustre 2.9 and later the default is to use
 551     one CPT per NUMA node.  In earlier versions of Lustre, by default there
 552     was a single CPT if the online CPU core count was four or fewer, and
 553     additional CPTs would be created depending on the number of CPU cores,
 554     typically with 4-8 cores per CPT.
 555     </para>
 556     <tip>
 557       <para>Setting <literal>cpu_npartitions=1</literal> will disable most
 558       of the SMP Node Affinity functionality.</para>
 559     </tip>
 560     <section>
 561       <title>CPU Partition String Patterns</title>
 562       <para>CPU partitions can be described using string pattern notation.
 563       If <literal>cpu_pattern=N</literal> is used, then there will be one
 564       CPT for each NUMA node in the system, with each CPT mapping all of
 565       the CPU cores for that NUMA node.
 566       </para>
 567       <para>It is also possible to explicitly specify the mapping between
 568       CPU cores and CPTs, for example:</para>
 569       <itemizedlist>
 570         <listitem>
 571           <para>
 572             <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
 573           </para>
 574           <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
 575           contains cores 3, 5, 7.  CPU cores 0 and 1 will not be used by Lustre
 576           service threads, and could be used for node services such as
 577           system monitoring, HA heartbeat threads, etc.  The binding of
 578           non-Lustre services to those CPU cores may be done in userspace
 579           using <literal>numactl(8)</literal> or other application-specific
 580           methods, but is beyond the scope of this document.</para>
 581         </listitem>
 582         <listitem>
 583           <para>
 584             <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
 585           </para>
 586           <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
 587           node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
 588         </listitem>
 589       </itemizedlist>
 590       <para>The current configuration of the CPU partition can be read via
 591       <literal>lctl get_parm cpu_partition_table</literal>.  For example,
 592       a simple 4-core system has a single CPT with all four CPU cores:
 593       <screen>$ lctl get_param cpu_partition_table
 594 cpu_partition_table=0   : 0 1 2 3</screen>
 595       while a larger NUMA system with four 12-core CPUs may have four CPTs:
 596       <screen>$ lctl get_param cpu_partition_table
 597 cpu_partition_table=
 598 0       : 0 1 2 3 4 5 6 7 8 9 10 11
 599 1       : 12 13 14 15 16 17 18 19 20 21 22 23
 600 2       : 24 25 26 27 28 29 30 31 32 33 34 35
 601 3       : 36 37 38 39 40 41 42 43 44 45 46 47
 602 </screen>
 603       </para>
 604     </section>
 605   </section>
 606   <section xml:id="dbdoclet.lndtuning">
 607     <title>
 608     <indexterm>
 609       <primary>tuning</primary>
 610       <secondary>LND tuning</secondary>
 611     </indexterm>LND Tuning</title>
 612     <para>LND tuning allows the number of threads per CPU partition to be
 613     specified. An administrator can set the threads for both
 614     <literal>ko2iblnd</literal> and
 615     <literal>ksocklnd</literal> using the
 616     <literal>nscheds</literal> parameter. This adjusts the number of threads for
 617     each partition, not the overall number of threads on the LND.</para>
 618     <note>
 619       <para>Lustre software release 2.3 has greatly decreased the default
 620       number of threads for
 621       <literal>ko2iblnd</literal> and
 622       <literal>ksocklnd</literal> on high-core count machines. The current
 623       default values are automatically set and are chosen to work well across a
 624       number of typical scenarios.</para>
 625     </note>
 626     <section>
 627         <title>ko2iblnd Tuning</title>
 628         <para>The following table outlines the ko2iblnd module parameters to be used
 629     for tuning:</para>
 630         <informaltable frame="all">
 631           <tgroup cols="3">
 632             <colspec colname="c1" colwidth="50*" />
 633             <colspec colname="c2" colwidth="50*" />
 634             <colspec colname="c3" colwidth="50*" />
 635             <thead>
 636               <row>
 637                 <entry>
 638                   <para>
 639                     <emphasis role="bold">Module Parameter</emphasis>
 640                   </para>
 641                 </entry>
 642                 <entry>
 643                   <para>
 644                     <emphasis role="bold">Default Value</emphasis>
 645                   </para>
 646                 </entry>
 647                 <entry>
 648                   <para>
 649                     <emphasis role="bold">Description</emphasis>
 650                   </para>
 651                 </entry>
 652               </row>
 653             </thead>
 654             <tbody>
 655               <row>
 656                 <entry>
 657                   <para>
 658                     <literal>service</literal>
 659                   </para>
 660                 </entry>
 661                 <entry>
 662                   <para>
 663                     <literal>987</literal>
 664                   </para>
 665                 </entry>
 666                 <entry>
 667                   <para>Service number (within RDMA_PS_TCP).</para>
 668                 </entry>
 669               </row>
 670               <row>
 671                 <entry>
 672                   <para>
 673                     <literal>cksum</literal>
 674                   </para>
 675                 </entry>
 676                 <entry>
 677                   <para>
 678                     <literal>0</literal>
 679                   </para>
 680                 </entry>
 681                 <entry>
 682                   <para>Set non-zero to enable message (not RDMA) checksums.</para>
 683                 </entry>
 684               </row>
 685               <row>
 686                 <entry>
 687                   <para>
 688                     <literal>timeout</literal>
 689                   </para>
 690                 </entry>
 691                 <entry>
 692                 <para>
 693                   <literal>50</literal>
 694                 </para>
 695               </entry>
 696                 <entry>
 697                   <para>Timeout in seconds.</para>
 698                 </entry>
 699               </row>
 700               <row>
 701                 <entry>
 702                   <para>
 703                     <literal>nscheds</literal>
 704                   </para>
 705                 </entry>
 706                 <entry>
 707                   <para>
 708                     <literal>0</literal>
 709                   </para>
 710                 </entry>
 711                 <entry>
 712                   <para>Number of threads in each scheduler pool (per CPT).  Value of
 713           zero means we derive the number from the number of cores.</para>
 714                 </entry>
 715               </row>
 716               <row>
 717                 <entry>
 718                   <para>
 719                     <literal>conns_per_peer</literal>
 720                   </para>
 721                 </entry>
 722                 <entry>
 723                   <para>
 724                     <literal>4 (OmniPath), 1 (Everything else)</literal>
 725                   </para>
 726                 </entry>
 727                 <entry>
 728                   <para>Introduced in 2.10. Number of connections to each peer. Messages
 729           are sent round-robin over the connection pool.  Provides signifiant
 730           improvement with OmniPath.</para>
 731                 </entry>
 732               </row>
 733               <row>
 734                 <entry>
 735                   <para>
 736                     <literal>ntx</literal>
 737                   </para>
 738                 </entry>
 739                 <entry>
 740                   <para>
 741                     <literal>512</literal>
 742                   </para>
 743                 </entry>
 744                 <entry>
 745                   <para>Number of message descriptors allocated for each pool at
 746           startup. Grows at runtime. Shared by all CPTs.</para>
 747                 </entry>
 748               </row>
 749               <row>
 750                 <entry>
 751                   <para>
 752                     <literal>credits</literal>
 753                   </para>
 754                 </entry>
 755                 <entry>
 756                   <para>
 757                     <literal>256</literal>
 758                   </para>
 759                 </entry>
 760                 <entry>
 761                   <para>Number of concurrent sends on network.</para>
 762                 </entry>
 763               </row>
 764               <row>
 765                 <entry>
 766                   <para>
 767                     <literal>peer_credits</literal>
 768                   </para>
 769                 </entry>
 770                 <entry>
 771                   <para>
 772                     <literal>8</literal>
 773                   </para>
 774                 </entry>
 775                 <entry>
 776                   <para>Number of concurrent sends to 1 peer. Related/limited by IB
 777           queue size.</para>
 778                 </entry>
 779               </row>
 780               <row>
 781                 <entry>
 782                   <para>
 783                     <literal>peer_credits_hiw</literal>
 784                   </para>
 785                 </entry>
 786                 <entry>
 787                   <para>
 788                     <literal>0</literal>
 789                   </para>
 790                 </entry>
 791                 <entry>
 792                   <para>When eagerly to return credits.</para>
 793                 </entry>
 794               </row>
 795               <row>
 796                 <entry>
 797                   <para>
 798                     <literal>peer_buffer_credits</literal>
 799                   </para>
 800                 </entry>
 801                 <entry>
 802                   <para>
 803                     <literal>0</literal>
 804                   </para>
 805                 </entry>
 806                 <entry>
 807                   <para>Number per-peer router buffer credits.</para>
 808                 </entry>
 809               </row>
 810               <row>
 811                 <entry>
 812                   <para>
 813                     <literal>peer_timeout</literal>
 814                   </para>
 815                 </entry>
 816                 <entry>
 817                   <para>
 818                     <literal>180</literal>
 819                   </para>
 820                 </entry>
 821                 <entry>
 822                   <para>Seconds without aliveness news to declare peer dead (less than
 823           or equal to 0 to disable).</para>
 824                 </entry>
 825               </row>
 826               <row>
 827                 <entry>
 828                   <para>
 829                     <literal>ipif_name</literal>
 830                   </para>
 831                 </entry>
 832                 <entry>
 833                   <para>
 834                     <literal>ib0</literal>
 835                   </para>
 836                 </entry>
 837                 <entry>
 838                   <para>IPoIB interface name.</para>
 839                 </entry>
 840               </row>
 841               <row>
 842                 <entry>
 843                   <para>
 844                     <literal>retry_count</literal>
 845                   </para>
 846                 </entry>
 847                 <entry>
 848                   <para>
 849                     <literal>5</literal>
 850                   </para>
 851                 </entry>
 852                 <entry>
 853                   <para>Retransmissions when no ACK received.</para>
 854                 </entry>
 855               </row>
 856               <row>
 857                 <entry>
 858                   <para>
 859                     <literal>rnr_retry_count</literal>
 860                   </para>
 861                 </entry>
 862                 <entry>
 863                   <para>
 864                     <literal>6</literal>
 865                   </para>
 866                 </entry>
 867                 <entry>
 868                   <para>RNR retransmissions.</para>
 869                 </entry>
 870               </row>
 871               <row>
 872                 <entry>
 873                   <para>
 874                     <literal>keepalive</literal>
 875                   </para>
 876                 </entry>
 877                 <entry>
 878                   <para>
 879                     <literal>100</literal>
 880                   </para>
 881                 </entry>
 882                 <entry>
 883                   <para>Idle time in seconds before sending a keepalive.</para>
 884                 </entry>
 885               </row>
 886               <row>
 887                 <entry>
 888                   <para>
 889                     <literal>ib_mtu</literal>
 890                   </para>
 891                 </entry>
 892                 <entry>
 893                   <para>
 894                     <literal>0</literal>
 895                   </para>
 896                 </entry>
 897                 <entry>
 898                   <para>IB MTU 256/512/1024/2048/4096.</para>
 899                 </entry>
 900               </row>
 901               <row>
 902                 <entry>
 903                   <para>
 904                     <literal>concurrent_sends</literal>
 905                   </para>
 906                 </entry>
 907                 <entry>
 908                   <para>
 909                     <literal>0</literal>
 910                   </para>
 911                 </entry>
 912                 <entry>
 913                   <para>Send work-queue sizing. If zero, derived from
 914           <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
 915           </para>
 916                 </entry>
 917               </row>
 918               <row>
 919                 <entry>
 920                   <para>
 921                     <literal>map_on_demand</literal>
 922                   </para>
 923                 </entry>
 924                 <entry>
 925                   <para>
 926             <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
 927                   </para>
 928                 </entry>
 929                 <entry>
 930                   <para>Number of fragments reserved for connection.  If zero, use
 931           global memory region (found to be security issue).  If non-zero, use
 932           FMR or FastReg for memory registration.  Value needs to agree between
 933           both peers of connection.</para>
 934                 </entry>
 935               </row>
 936               <row>
 937                 <entry>
 938                   <para>
 939                     <literal>fmr_pool_size</literal>
 940                   </para>
 941                 </entry>
 942                 <entry>
 943                   <para>
 944                     <literal>512</literal>
 945                   </para>
 946                 </entry>
 947                 <entry>
 948                   <para>Size of fmr pool on each CPT (>= ntx / 4).  Grows at runtime.
 949           </para>
 950                 </entry>
 951               </row>
 952               <row>
 953                 <entry>
 954                   <para>
 955                     <literal>fmr_flush_trigger</literal>
 956                   </para>
 957                 </entry>
 958                 <entry>
 959                   <para>
 960                     <literal>384</literal>
 961                   </para>
 962                 </entry>
 963                 <entry>
 964                   <para>Number dirty FMRs that triggers pool flush.</para>
 965                 </entry>
 966               </row>
 967               <row>
 968                 <entry>
 969                   <para>
 970                     <literal>fmr_cache</literal>
 971                   </para>
 972                 </entry>
 973                 <entry>
 974                   <para>
 975                     <literal>1</literal>
 976                   </para>
 977                 </entry>
 978                 <entry>
 979                   <para>Non-zero to enable FMR caching.</para>
 980                 </entry>
 981               </row>
 982               <row>
 983                 <entry>
 984                   <para>
 985                     <literal>dev_failover</literal>
 986                   </para>
 987                 </entry>
 988                 <entry>
 989                   <para>
 990                     <literal>0</literal>
 991                   </para>
 992                 </entry>
 993                 <entry>
 994                   <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
 995           </para>
 996                 </entry>
 997               </row>
 998               <row>
 999                 <entry>
1000                   <para>
1001                     <literal>require_privileged_port</literal>
1002                   </para>
1003                 </entry>
1004                 <entry>
1005                   <para>
1006                     <literal>0</literal>
1007                   </para>
1008                 </entry>
1009                 <entry>
1010                   <para>Require privileged port when accepting connection.</para>
1011                 </entry>
1012               </row>
1013               <row>
1014                 <entry>
1015                   <para>
1016                     <literal>use_privileged_port</literal>
1017                   </para>
1018                 </entry>
1019                 <entry>
1020                   <para>
1021                     <literal>1</literal>
1022                   </para>
1023                 </entry>
1024                 <entry>
1025                   <para>Use privileged port when initiating connection.</para>
1026                 </entry>
1027               </row>
1028               <row>
1029                 <entry>
1030                   <para>
1031                     <literal>wrq_sge</literal>
1032                   </para>
1033                 </entry>
1034                 <entry>
1035                   <para>
1036                     <literal>2</literal>
1037                   </para>
1038                 </entry>
1039                 <entry>
1040                   <para>Introduced in 2.10. Number scatter/gather element groups per
1041           work request.  Used to deal with fragmentations which can consume
1042           double the number of work requests.</para>
1043                 </entry>
1044               </row>
1045             </tbody>
1046           </tgroup>
1047         </informaltable>
1048     </section>
1049   </section>
1050   <section xml:id="dbdoclet.nrstuning" condition='l24'>
1051     <title>
1052     <indexterm>
1053       <primary>tuning</primary>
1054       <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1055     </indexterm>Network Request Scheduler (NRS) Tuning</title>
1056     <para>The Network Request Scheduler (NRS) allows the administrator to
1057     influence the order in which RPCs are handled at servers, on a per-PTLRPC
1058     service basis, by providing different policies that can be activated and
1059     tuned in order to influence the RPC ordering. The aim of this is to provide
1060     for better performance, and possibly discrete performance characteristics
1061     using future policies.</para>
1062     <para>The NRS policy state of a PTLRPC service can be read and set via the
1063     <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
1064     service's NRS policy state, run:</para>
1065     <screen>
1066 lctl get_param {service}.nrs_policies
1067 </screen>
1068     <para>For example, to read the NRS policy state of the
1069     <literal>ost_io</literal> service, run:</para>
1070     <screen>
1071 $ lctl get_param ost.OSS.ost_io.nrs_policies
1072 ost.OSS.ost_io.nrs_policies=
1073
1074 regular_requests:
1075   - name: fifo
1076     state: started
1077     fallback: yes
1078     queued: 0
1079     active: 0
1080
1081   - name: crrn
1082     state: stopped
1083     fallback: no
1084     queued: 0
1085     active: 0
1086
1087   - name: orr
1088     state: stopped
1089     fallback: no
1090     queued: 0
1091     active: 0
1092
1093   - name: trr
1094     state: started
1095     fallback: no
1096     queued: 2420
1097     active: 268
1098
1099   - name: tbf
1100     state: stopped
1101     fallback: no
1102     queued: 0
1103     active: 0
1104
1105   - name: delay
1106     state: stopped
1107     fallback: no
1108     queued: 0
1109     active: 0
1110
1111 high_priority_requests:
1112   - name: fifo
1113     state: started
1114     fallback: yes
1115     queued: 0
1116     active: 0
1117
1118   - name: crrn
1119     state: stopped
1120     fallback: no
1121     queued: 0
1122     active: 0
1123
1124   - name: orr
1125     state: stopped
1126     fallback: no
1127     queued: 0
1128     active: 0
1129
1130   - name: trr
1131     state: stopped
1132     fallback: no
1133     queued: 0
1134     active: 0
1135
1136   - name: tbf
1137     state: stopped
1138     fallback: no
1139     queued: 0
1140     active: 0
1141
1142   - name: delay
1143     state: stopped
1144     fallback: no
1145     queued: 0
1146     active: 0
1147
1148 </screen>
1149     <para>NRS policy state is shown in either one or two sections, depending on
1150     the PTLRPC service being queried. The first section is named
1151     <literal>regular_requests</literal> and is available for all PTLRPC
1152     services, optionally followed by a second section which is named
1153     <literal>high_priority_requests</literal>. This is because some PTLRPC
1154     services are able to treat some types of RPCs as higher priority ones, such
1155     that they are handled by the server with higher priority compared to other,
1156     regular RPC traffic. For PTLRPC services that do not support high-priority
1157     RPCs, you will only see the
1158     <literal>regular_requests</literal> section.</para>
1159     <para>There is a separate instance of each NRS policy on each PTLRPC
1160     service for handling regular and high-priority RPCs (if the service
1161     supports high-priority RPCs). For each policy instance, the following
1162     fields are shown:</para>
1163     <informaltable frame="all">
1164       <tgroup cols="2">
1165         <colspec colname="c1" colwidth="50*" />
1166         <colspec colname="c2" colwidth="50*" />
1167         <thead>
1168           <row>
1169             <entry>
1170               <para>
1171                 <emphasis role="bold">Field</emphasis>
1172               </para>
1173             </entry>
1174             <entry>
1175               <para>
1176                 <emphasis role="bold">Description</emphasis>
1177               </para>
1178             </entry>
1179           </row>
1180         </thead>
1181         <tbody>
1182           <row>
1183             <entry>
1184               <para>
1185                 <literal>name</literal>
1186               </para>
1187             </entry>
1188             <entry>
1189               <para>The name of the policy.</para>
1190             </entry>
1191           </row>
1192           <row>
1193             <entry>
1194               <para>
1195                 <literal>state</literal>
1196               </para>
1197             </entry>
1198             <entry>
1199               <para>The state of the policy; this can be any of
1200               <literal>invalid, stopping, stopped, starting, started</literal>.
1201               A fully enabled policy is in the
1202               <literal>started</literal> state.</para>
1203             </entry>
1204           </row>
1205           <row>
1206             <entry>
1207               <para>
1208                 <literal>fallback</literal>
1209               </para>
1210             </entry>
1211             <entry>
1212               <para>Whether the policy is acting as a fallback policy or not. A
1213               fallback policy is used to handle RPCs that other enabled
1214               policies fail to handle, or do not support the handling of. The
1215               possible values are
1216               <literal>no, yes</literal>. Currently, only the FIFO policy can
1217               act as a fallback policy.</para>
1218             </entry>
1219           </row>
1220           <row>
1221             <entry>
1222               <para>
1223                 <literal>queued</literal>
1224               </para>
1225             </entry>
1226             <entry>
1227               <para>The number of RPCs that the policy has waiting to be
1228               serviced.</para>
1229             </entry>
1230           </row>
1231           <row>
1232             <entry>
1233               <para>
1234                 <literal>active</literal>
1235               </para>
1236             </entry>
1237             <entry>
1238               <para>The number of RPCs that the policy is currently
1239               handling.</para>
1240             </entry>
1241           </row>
1242         </tbody>
1243       </tgroup>
1244     </informaltable>
1245     <para>To enable an NRS policy on a PTLRPC service run:</para>
1246     <screen>
1247 lctl set_param {service}.nrs_policies=
1248 <replaceable>policy_name</replaceable>
1249 </screen>
1250     <para>This will enable the policy
1251     <replaceable>policy_name</replaceable>for both regular and high-priority
1252     RPCs (if the PLRPC service supports high-priority RPCs) on the given
1253     service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
1254     service, run:</para>
1255     <screen>
1256 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
1257 ldlm.services.ldlm_cbd.nrs_policies=crrn
1258
1259 </screen>
1260     <para>For PTLRPC services that support high-priority RPCs, you can also
1261     supply an optional
1262     <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
1263     for handling only regular or high-priority RPCs on a given PTLRPC service,
1264     by running:</para>
1265     <screen>
1266 lctl set_param {service}.nrs_policies="
1267 <replaceable>policy_name</replaceable>
1268 <replaceable>reg|hp</replaceable>"
1269 </screen>
1270     <para>For example, to enable the TRR policy for handling only regular, but
1271     not high-priority RPCs on the
1272     <literal>ost_io</literal> service, run:</para>
1273     <screen>
1274 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
1275 ost.OSS.ost_io.nrs_policies="trr reg"
1276
1277 </screen>
1278     <note>
1279       <para>When enabling an NRS policy, the policy name must be given in
1280       lower-case characters, otherwise the operation will fail with an error
1281       message.</para>
1282     </note>
1283     <section>
1284       <title>
1285       <indexterm>
1286         <primary>tuning</primary>
1287         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1288         <tertiary>first in, first out (FIFO) policy</tertiary>
1289       </indexterm>First In, First Out (FIFO) policy</title>
1290       <para>The first in, first out (FIFO) policy handles RPCs in a service in
1291       the same order as they arrive from the LNet layer, so no special
1292       processing takes place to modify the RPC handling stream. FIFO is the
1293       default policy for all types of RPCs on all PTLRPC services, and is
1294       always enabled irrespective of the state of other policies, so that it
1295       can be used as a backup policy, in case a more elaborate policy that has
1296       been enabled fails to handle an RPC, or does not support handling a given
1297       type of RPC.</para>
1298       <para>The FIFO policy has no tunables that adjust its behaviour.</para>
1299     </section>
1300     <section>
1301       <title>
1302       <indexterm>
1303         <primary>tuning</primary>
1304         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1305         <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
1306       </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
1307       <para>The client round-robin over NIDs (CRR-N) policy performs batched
1308       round-robin scheduling of all types of RPCs, with each batch consisting
1309       of RPCs originating from the same client node, as identified by its NID.
1310       CRR-N aims to provide for better resource utilization across the cluster,
1311       and to help shorten completion times of jobs in some cases, by
1312       distributing available bandwidth more evenly across all clients.</para>
1313       <para>The CRR-N policy can be enabled on all types of PTLRPC services,
1314       and has the following tunable that can be used to adjust its
1315       behavior:</para>
1316       <itemizedlist>
1317         <listitem>
1318           <para>
1319             <literal>{service}.nrs_crrn_quantum</literal>
1320           </para>
1321           <para>The
1322           <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
1323           maximum allowed size of each batch of RPCs; the unit of measure is in
1324           number of RPCs. To read the maximum allowed batch size of a CRR-N
1325           policy, run:</para>
1326           <screen>
1327 lctl get_param {service}.nrs_crrn_quantum
1328 </screen>
1329           <para>For example, to read the maximum allowed batch size of a CRR-N
1330           policy on the ost_io service, run:</para>
1331           <screen>
1332 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
1333 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
1334 hp_quantum:8
1335
1336 </screen>
1337           <para>You can see that there is a separate maximum allowed batch size
1338           value for regular (
1339           <literal>reg_quantum</literal>) and high-priority (
1340           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1341           high-priority RPCs).</para>
1342           <para>To set the maximum allowed batch size of a CRR-N policy on a
1343           given service, run:</para>
1344           <screen>
1345 lctl set_param {service}.nrs_crrn_quantum=
1346 <replaceable>1-65535</replaceable>
1347 </screen>
1348           <para>This will set the maximum allowed batch size on a given
1349           service, for both regular and high-priority RPCs (if the PLRPC
1350           service supports high-priority RPCs), to the indicated value.</para>
1351           <para>For example, to set the maximum allowed batch size on the
1352           ldlm_canceld service to 16 RPCs, run:</para>
1353           <screen>
1354 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1355 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1356
1357 </screen>
1358           <para>For PTLRPC services that support high-priority RPCs, you can
1359           also specify a different maximum allowed batch size for regular and
1360           high-priority RPCs, by running:</para>
1361           <screen>
1362 $ lctl set_param {service}.nrs_crrn_quantum=
1363 <replaceable>reg_quantum|hp_quantum</replaceable>:
1364 <replaceable>1-65535</replaceable>"
1365 </screen>
1366           <para>For example, to set the maximum allowed batch size on the
1367           ldlm_canceld service, for high-priority RPCs to 32, run:</para>
1368           <screen>
1369 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
1370 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
1371
1372 </screen>
1373           <para>By using the last method, you can also set the maximum regular
1374           and high-priority RPC batch sizes to different values, in a single
1375           command invocation.</para>
1376         </listitem>
1377       </itemizedlist>
1378     </section>
1379     <section>
1380       <title>
1381       <indexterm>
1382         <primary>tuning</primary>
1383         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1384         <tertiary>object-based round-robin (ORR) policy</tertiary>
1385       </indexterm>Object-based Round-Robin (ORR) policy</title>
1386       <para>The object-based round-robin (ORR) policy performs batched
1387       round-robin scheduling of bulk read write (brw) RPCs, with each batch
1388       consisting of RPCs that pertain to the same backend-file system object,
1389       as identified by its OST FID.</para>
1390       <para>The ORR policy is only available for use on the ost_io service. The
1391       RPC batches it forms can potentially consist of mixed bulk read and bulk
1392       write RPCs. The RPCs in each batch are ordered in an ascending manner,
1393       based on either the file offsets, or the physical disk offsets of each
1394       RPC (only applicable to bulk read RPCs).</para>
1395       <para>The aim of the ORR policy is to provide for increased bulk read
1396       throughput in some cases, by ordering bulk read RPCs (and potentially
1397       bulk write RPCs), and thus minimizing costly disk seek operations.
1398       Performance may also benefit from any resulting improvement in resource
1399       utilization, or by taking advantage of better locality of reference
1400       between RPCs.</para>
1401       <para>The ORR policy has the following tunables that can be used to
1402       adjust its behaviour:</para>
1403       <itemizedlist>
1404         <listitem>
1405           <para>
1406             <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
1407           </para>
1408           <para>The
1409           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
1410           the maximum allowed size of each batch of RPCs; the unit of measure
1411           is in number of RPCs. To read the maximum allowed batch size of the
1412           ORR policy, run:</para>
1413           <screen>
1414 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
1415 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
1416 hp_quantum:16
1417
1418 </screen>
1419           <para>You can see that there is a separate maximum allowed batch size
1420           value for regular (
1421           <literal>reg_quantum</literal>) and high-priority (
1422           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1423           high-priority RPCs).</para>
1424           <para>To set the maximum allowed batch size for the ORR policy,
1425           run:</para>
1426           <screen>
1427 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1428 <replaceable>1-65535</replaceable>
1429 </screen>
1430           <para>This will set the maximum allowed batch size for both regular
1431           and high-priority RPCs, to the indicated value.</para>
1432           <para>You can also specify a different maximum allowed batch size for
1433           regular and high-priority RPCs, by running:</para>
1434           <screen>
1435 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1436 <replaceable>reg_quantum|hp_quantum</replaceable>:
1437 <replaceable>1-65535</replaceable>
1438 </screen>
1439           <para>For example, to set the maximum allowed batch size for regular
1440           RPCs to 128, run:</para>
1441           <screen>
1442 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1443 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1444
1445 </screen>
1446           <para>By using the last method, you can also set the maximum regular
1447           and high-priority RPC batch sizes to different values, in a single
1448           command invocation.</para>
1449         </listitem>
1450         <listitem>
1451           <para>
1452             <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
1453           </para>
1454           <para>The
1455           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
1456           determines whether the ORR policy orders RPCs within each batch based
1457           on logical file offsets or physical disk offsets. To read the offset
1458           type value for the ORR policy, run:</para>
1459           <screen>
1460 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
1461 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
1462 hp_offset_type:logical
1463
1464 </screen>
1465           <para>You can see that there is a separate offset type value for
1466           regular (
1467           <literal>reg_offset_type</literal>) and high-priority (
1468           <literal>hp_offset_type</literal>) RPCs.</para>
1469           <para>To set the ordering type for the ORR policy, run:</para>
1470           <screen>
1471 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1472 <replaceable>physical|logical</replaceable>
1473 </screen>
1474           <para>This will set the offset type for both regular and
1475           high-priority RPCs, to the indicated value.</para>
1476           <para>You can also specify a different offset type for regular and
1477           high-priority RPCs, by running:</para>
1478           <screen>
1479 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1480 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
1481 <replaceable>physical|logical</replaceable>
1482 </screen>
1483           <para>For example, to set the offset type for high-priority RPCs to
1484           physical disk offsets, run:</para>
1485           <screen>
1486 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1487 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1488 </screen>
1489           <para>By using the last method, you can also set offset type for
1490           regular and high-priority RPCs to different values, in a single
1491           command invocation.</para>
1492           <note>
1493             <para>Irrespective of the value of this tunable, only logical
1494             offsets can, and are used for ordering bulk write RPCs.</para>
1495           </note>
1496         </listitem>
1497         <listitem>
1498           <para>
1499             <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1500           </para>
1501           <para>The
1502           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1503           the type of RPCs that the ORR policy will handle. To read the types
1504           of supported RPCs by the ORR policy, run:</para>
1505           <screen>
1506 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1507 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1508 hp_supported=reads_and_writes
1509
1510 </screen>
1511           <para>You can see that there is a separate supported 'RPC types'
1512           value for regular (
1513           <literal>reg_supported</literal>) and high-priority (
1514           <literal>hp_supported</literal>) RPCs.</para>
1515           <para>To set the supported RPC types for the ORR policy, run:</para>
1516           <screen>
1517 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1518 <replaceable>reads|writes|reads_and_writes</replaceable>
1519 </screen>
1520           <para>This will set the supported RPC types for both regular and
1521           high-priority RPCs, to the indicated value.</para>
1522           <para>You can also specify a different supported 'RPC types' value
1523           for regular and high-priority RPCs, by running:</para>
1524           <screen>
1525 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1526 <replaceable>reg_supported|hp_supported</replaceable>:
1527 <replaceable>reads|writes|reads_and_writes</replaceable>
1528 </screen>
1529           <para>For example, to set the supported RPC types to bulk read and
1530           bulk write RPCs for regular requests, run:</para>
1531           <screen>
1532 $ lctl set_param
1533 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1534 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1535
1536 </screen>
1537           <para>By using the last method, you can also set the supported RPC
1538           types for regular and high-priority RPC to different values, in a
1539           single command invocation.</para>
1540         </listitem>
1541       </itemizedlist>
1542     </section>
1543     <section>
1544       <title>
1545       <indexterm>
1546         <primary>tuning</primary>
1547         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1548         <tertiary>Target-based round-robin (TRR) policy</tertiary>
1549       </indexterm>Target-based Round-Robin (TRR) policy</title>
1550       <para>The target-based round-robin (TRR) policy performs batched
1551       round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1552       that pertain to the same OST, as identified by its OST index.</para>
1553       <para>The TRR policy is identical to the object-based round-robin (ORR)
1554       policy, apart from using the brw RPC's target OST index instead of the
1555       backend-fs object's OST FID, for determining the RPC scheduling order.
1556       The goals of TRR are effectively the same as for ORR, and it uses the
1557       following tunables to adjust its behaviour:</para>
1558       <itemizedlist>
1559         <listitem>
1560           <para>
1561             <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1562           </para>
1563           <para>The purpose of this tunable is exactly the same as for the
1564           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1565           policy, and you can use it in exactly the same way.</para>
1566         </listitem>
1567         <listitem>
1568           <para>
1569             <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1570           </para>
1571           <para>The purpose of this tunable is exactly the same as for the
1572           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1573           ORR policy, and you can use it in exactly the same way.</para>
1574         </listitem>
1575         <listitem>
1576           <para>
1577             <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1578           </para>
1579           <para>The purpose of this tunable is exactly the same as for the
1580           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1581           ORR policy, and you can use it in exactly the sme way.</para>
1582         </listitem>
1583       </itemizedlist>
1584     </section>
1585     <section xml:id="dbdoclet.tbftuning" condition='l26'>
1586       <title>
1587       <indexterm>
1588         <primary>tuning</primary>
1589         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1590         <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1591       </indexterm>Token Bucket Filter (TBF) policy</title>
1592       <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1593       Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1594       (Quality of Service) purposes.</para>
1595       <figure>
1596         <title>The internal structure of TBF policy</title>
1597         <mediaobject>
1598           <imageobject>
1599             <imagedata scalefit="1" width="100%"
1600             fileref="figures/TBF_policy.svg" />
1601           </imageobject>
1602           <textobject>
1603             <phrase>The internal structure of TBF policy</phrase>
1604           </textobject>
1605         </mediaobject>
1606       </figure>
1607       <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1608       according to its classification. The classification of RPC requests is
1609       based on either NID or JobID of the RPC according to the configure of
1610       TBF. TBF policy maintains multiple queues in the system, one queue for
1611       each category in the classification of RPC requests. The requests waits
1612       for tokens in the FIFO queue before they have been handled so as to keep
1613       the RPC rates under the limits.</para>
1614       <para>When Lustre services are too busy to handle all of the requests in
1615       time, all of the specified rates of the queues will not be satisfied.
1616       Nothing bad will happen except some of the RPC rates are slower than
1617       configured. In this case, the queue with higher rate will have an
1618       advantage over the queues with lower rates, but none of them will be
1619       starved.</para>
1620       <para>To manage the RPC rate of queues, we don't need to set the rate of
1621       each queue manually. Instead, we define rules which TBF policy matches to
1622       determine RPC rate limits. All of the defined rules are organized as an
1623       ordered list. Whenever a queue is newly created, it goes though the rule
1624       list and takes the first matched rule as its rule, so that the queue
1625       knows its RPC token rate. A rule can be added to or removed from the list
1626       at run time. Whenever the list of rules is changed, the queues will
1627       update their matched rules.</para>
1628       <section remap="h4">
1629         <title>Enable TBF policy</title>
1630         <para>Command:</para>
1631         <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf &lt;<replaceable>policy</replaceable>&gt;"
1632         </screen>
1633         <para>For now, the RPCs can be classified into the different types
1634         according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
1635         policy, you can specify one of the types, or just use "tbf" to enable
1636         all of them to do a fine-grained RPC requests classification.</para>
1637         <para>Example:</para>
1638         <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
1639 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
1640 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
1641 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
1642 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
1643 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
1644       </section>
1645       <section remap="h4">
1646         <title>Start a TBF rule</title>
1647         <para>The TBF rule is defined in the parameter
1648         <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
1649         <para>Command:</para>
1650         <screen>lctl set_param x.x.x.nrs_tbf_rule=
1651 "[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
1652         </screen>
1653         <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
1654         policy rule's name and '<replaceable>arguments</replaceable>' is a
1655         string to specify the detailed rule according to the different types.
1656         </para>
1657         <itemizedlist>
1658         <para>Next, the different types of TBF policies will be described.</para>
1659           <listitem>
1660             <para><emphasis role="bold">NID based TBF policy</emphasis></para>
1661             <para>Command:</para>
1662             <screen>lctl set_param x.x.x.nrs_tbf_rule=
1663 "[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
1664             </screen>
1665             <para>'<replaceable>nidlist</replaceable>' uses the same format
1666             as configuring LNET route. '<replaceable>rate</replaceable>' is
1667             the (upper limit) RPC rate of the rule.</para>
1668             <para>Example:</para>
1669             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1670 "start other_clients nid={192.168.*.*@tcp} rate=50"
1671 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1672 "start computes nid={192.168.1.[2-128]@tcp} rate=500"
1673 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1674 "start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
1675             <para>In this example, the rate of processing RPC requests from
1676             compute nodes is at most 5x as fast as those from login nodes.
1677             The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
1678             like:</para>
1679             <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1680 ost.OSS.ost_io.nrs_tbf_rule=
1681 regular_requests:
1682 CPT 0:
1683 loginnode {192.168.1.1@tcp} 100, ref 0
1684 computes {192.168.1.[2-128]@tcp} 500, ref 0
1685 other_clients {192.168.*.*@tcp} 50, ref 0
1686 default {*} 10000, ref 0
1687 high_priority_requests:
1688 CPT 0:
1689 loginnode {192.168.1.1@tcp} 100, ref 0
1690 computes {192.168.1.[2-128]@tcp} 500, ref 0
1691 other_clients {192.168.*.*@tcp} 50, ref 0
1692 default {*} 10000, ref 0</screen>
1693             <para>Also, the rule can be written in <literal>reg</literal> and
1694             <literal>hp</literal> formats:</para>
1695             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1696 "reg start loginnode nid={192.168.1.1@tcp} rate=100"
1697 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1698 "hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
1699           </listitem>
1700           <listitem>
1701             <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
1702             <para>For the JobID, please see
1703             <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1704             linkend="dbdoclet.jobstats" /> for more details.</para>
1705             <para>Command:</para>
1706             <screen>lctl set_param x.x.x.nrs_tbf_rule=
1707 "[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
1708             </screen>
1709             <para>Wildcard is supported in
1710             {<replaceable>jobid_list</replaceable>}.</para>
1711             <para>Example:</para>
1712             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1713 "start iozone_user jobid={iozone.500} rate=100"
1714 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1715 "start dd_user jobid={dd.*} rate=50"
1716 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1717 "start user1 jobid={*.600} rate=10"
1718 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1719 "start user2 jobid={io*.10* *.500} rate=200"</screen>
1720             <para>Also, the rule can be written in <literal>reg</literal> and
1721             <literal>hp</literal> formats:</para>
1722             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1723 "hp start iozone_user1 jobid={iozone.500} rate=100"
1724 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1725 "reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
1726           </listitem>
1727           <listitem>
1728             <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
1729             <para>Command:</para>
1730             <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
1731 "[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
1732             </screen>
1733             <para>Example:</para>
1734             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1735 "start user1 opcode={ost_read} rate=100"
1736 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1737 "start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
1738             <para>Also, the rule can be written in <literal>reg</literal> and
1739             <literal>hp</literal> formats:</para>
1740             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1741 "hp start iozone_user1 opcode={ost_read} rate=100"
1742 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1743 "reg start iozone_user1 opcode={ost_read} rate=100"</screen>
1744           </listitem>
1745           <listitem>
1746       <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
1747             <para>Command:</para>
1748             <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1749 "[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
1750 $ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1751 "[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
1752             <para>Exapmle:</para>
1753             <para>Limit the rate of RPC requests of the uid 500</para>
1754             <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1755 "start tbf_name uid={500} rate=100"</screen>
1756             <para>Limit the rate of RPC requests of the gid 500</para>
1757             <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1758 "start tbf_name gid={500} rate=100"</screen>
1759             <para>Also, you can use the following rule to control all reqs
1760             to mds:</para>
1761             <para>Start the tbf uid QoS on MDS:</para>
1762             <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
1763             <para>Limit the rate of RPC requests of the uid 500</para>
1764             <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
1765 "start tbf_name uid={500} rate=100"</screen>
1766           </listitem>
1767           <listitem>
1768             <para><emphasis role="bold">Policy combination</emphasis></para>
1769             <para>To support TBF rules with complex expressions of conditions,
1770             TBF classifier is extented to classify RPC in a more fine-grained
1771             way. This feature supports logical conditional conjunction and
1772             disjunction operations among different types.
1773             In the rule:
1774             "&amp;" represents the conditional conjunction and
1775             "," represents the conditional disjunction.</para>
1776             <para>Example:</para>
1777             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1778 "start comp_rule opcode={ost_write}&amp;jobid={dd.0},\
1779 nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
1780             <para>In this example, those RPCs whose <literal>opcode</literal> is
1781             ost_write and <literal>jobid</literal> is dd.0, or
1782             <literal>nid</literal> satisfies the condition of
1783             {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
1784             req/sec.
1785             The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
1786             </para>
1787             <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1788 ost.OSS.ost_io.nrs_tbf_rule=
1789 regular_requests:
1790 CPT 0:
1791 comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1792 default * 10000, ref 0
1793 CPT 1:
1794 comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1795 default * 10000, ref 0
1796 high_priority_requests:
1797 CPT 0:
1798 comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1799 default * 10000, ref 0
1800 CPT 1:
1801 comp_rule opcode={ost_write}&amp;jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1802 default * 10000, ref 0</screen>
1803             <para>Example:</para>
1804             <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1805 "start tbf_name uid={500}&amp;gid={500} rate=100"</screen>
1806             <para>In this example, those RPC requests whose uid is 500 and
1807             gid is 500 will be processed at the rate of 100 req/sec.</para>
1808           </listitem>
1809         </itemizedlist>
1810       </section>
1811       <section remap="h4">
1812           <title>Change a TBF rule</title>
1813           <para>Command:</para>
1814           <screen>lctl set_param x.x.x.nrs_tbf_rule=
1815 "[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
1816           </screen>
1817           <para>Example:</para>
1818           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1819 "change loginnode rate=200"
1820 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1821 "reg change loginnode rate=200"
1822 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1823 "hp change loginnode rate=200"
1824 </screen>
1825       </section>
1826       <section remap="h4">
1827           <title>Stop a TBF rule</title>
1828           <para>Command:</para>
1829           <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1830 <replaceable>rule_name</replaceable>"</screen>
1831           <para>Example:</para>
1832           <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1833 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1834 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
1835       </section>
1836       <section remap="h4">
1837         <title>Rule options</title>
1838         <para>To support more flexible rule conditions, the following options
1839         are added.</para>
1840         <itemizedlist>
1841           <listitem>
1842             <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
1843             <para>By default, a newly started rule is prior to the old ones,
1844             but by specifying the argument '<literal>rank=</literal>' when
1845             inserting a new rule with "<literal>start</literal>" command,
1846             the rank of the rule can be changed. Also, it can be changed by
1847             "<literal>change</literal>" command.
1848             </para>
1849             <para>Command:</para>
1850             <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1851 "start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
1852 lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1853 "change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
1854 </screen>
1855             <para>By specifying the existing rule
1856             '<replaceable>obj_rule_name</replaceable>', the new rule
1857             '<replaceable>rule_name</replaceable>' will be moved to the front of
1858             '<replaceable>obj_rule_name</replaceable>'.</para>
1859             <para>Example:</para>
1860             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1861 "start computes nid={192.168.1.[2-128]@tcp} rate=500"
1862 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1863 "start user1 jobid={iozone.500 dd.500} rate=100"
1864 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1865 "start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
1866             <para>In this example, rule "iozone_user1" is added to the front of
1867             rule "computes". We can see the order by the following command:
1868             </para>
1869             <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1870 ost.OSS.ost_io.nrs_tbf_rule=
1871 regular_requests:
1872 CPT 0:
1873 user1 jobid={iozone.500 dd.500} 100, ref 0
1874 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1875 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1876 default * 10000, ref 0
1877 CPT 1:
1878 user1 jobid={iozone.500 dd.500} 100, ref 0
1879 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1880 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1881 default * 10000, ref 0
1882 high_priority_requests:
1883 CPT 0:
1884 user1 jobid={iozone.500 dd.500} 100, ref 0
1885 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1886 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1887 default * 10000, ref 0
1888 CPT 1:
1889 user1 jobid={iozone.500 dd.500} 100, ref 0
1890 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1891 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1892 default * 10000, ref 0</screen>
1893           </listitem>
1894           <listitem>
1895             <para><emphasis role="bold">TBF realtime policies under congestion
1896             </emphasis></para>
1897             <para>During TBF evaluation, we find that when the sum of I/O
1898             bandwidth requirements for all classes exceeds the system capacity,
1899             the classes with the same rate limits get less bandwidth than if
1900             preconfigured evenly. The reason for this is the heavy load on a
1901             congested server will result in some missed deadlines for some
1902             classes. The number of the calculated tokens may be larger than 1
1903             during dequeuing. In the original implementation, all classes are
1904             equally handled to simply discard exceeding tokens.</para>
1905             <para>Thus, a Hard Token Compensation (HTC) strategy has been
1906             implemented. A class can be configured with the HTC feature by the
1907             rule it matches. This feature means that requests in this kind of
1908             class queues have high real-time requirements and that the bandwidth
1909             assignment must be satisfied as good as possible. When deadline
1910             misses happen, the class keeps the deadline unchanged and the time
1911             residue(the remainder of elapsed time divided by 1/r) is compensated
1912             to the next round. This ensures that the next idle I/O thread will
1913             always select this class to serve until all accumulated exceeding
1914             tokens are handled or there are no pending requests in the class
1915             queue.</para>
1916             <para>Command:</para>
1917             <para>A new command format is added to enable the realtime feature
1918             for a rule:</para>
1919             <screen>lctl set_param x.x.x.nrs_tbf_rule=\
1920 "start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
1921             <para>Example:</para>
1922             <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1923 "start realjob jobid={dd.0} rate=100 realtime=1</screen>
1924             <para>This example rule means the RPC requests whose JobID is dd.0
1925             will be processed at the rate of 100req/sec in realtime.</para>
1926           </listitem>
1927         </itemizedlist>
1928       </section>
1929     </section>
1930     <section xml:id="dbdoclet.delaytuning" condition='l2A'>
1931       <title>
1932       <indexterm>
1933         <primary>tuning</primary>
1934         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1935         <tertiary>Delay policy</tertiary>
1936       </indexterm>Delay policy</title>
1937       <para>The NRS Delay policy seeks to perturb the timing of request
1938       processing at the PtlRPC layer, with the goal of simulating high server
1939       load, and finding and exposing timing related problems. When this policy
1940       is active, upon arrival of a request the policy will calculate an offset,
1941       within a defined, user-configurable range, from the request arrival
1942       time, to determine a time after which the request should be handled.
1943       The request is then stored using the cfs_binheap implementation,
1944       which sorts the request according to the assigned start time.
1945       Requests are removed from the binheap for handling once their start
1946       time has been passed.</para>
1947       <para>The Delay policy can be enabled on all types of PtlRPC services,
1948       and has the following tunables that can be used to adjust its behavior:
1949       </para>
1950       <itemizedlist>
1951         <listitem>
1952           <para>
1953             <literal>{service}.nrs_delay_min</literal>
1954           </para>
1955           <para>The
1956           <literal>{service}.nrs_delay_min</literal> tunable controls the
1957           minimum amount of time, in seconds, that a request will be delayed by
1958           this policy.  The default is 5 seconds. To read this value run:</para>
1959           <screen>
1960 lctl get_param {service}.nrs_delay_min</screen>
1961           <para>For example, to read the minimum delay set on the ost_io
1962           service, run:</para>
1963           <screen>
1964 $ lctl get_param ost.OSS.ost_io.nrs_delay_min
1965 ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
1966 hp_delay_min:5</screen>
1967         <para>To set the minimum delay in RPC processing, run:</para>
1968         <screen>
1969 lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
1970         <para>This will set the minimum delay time on a given service, for both
1971         regular and high-priority RPCs (if the PtlRPC service supports
1972         high-priority RPCs), to the indicated value.</para>
1973         <para>For example, to set the minimum delay time on the ost_io service
1974         to 10, run:</para>
1975         <screen>
1976 $ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
1977 ost.OSS.ost_io.nrs_delay_min=10</screen>
1978         <para>For PtlRPC services that support high-priority RPCs, to set a
1979         different minimum delay time for regular and high-priority RPCs, run:
1980         </para>
1981         <screen>
1982 lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
1983         </screen>
1984         <para>For example, to set the minimum delay time on the ost_io service
1985         for high-priority RPCs to 3, run:</para>
1986         <screen>
1987 $ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
1988 ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
1989         <para>Note, in all cases the minimum delay time cannot exceed the
1990         maximum delay time.</para>
1991         </listitem>
1992         <listitem>
1993           <para>
1994             <literal>{service}.nrs_delay_max</literal>
1995           </para>
1996           <para>The
1997           <literal>{service}.nrs_delay_max</literal> tunable controls the
1998           maximum amount of time, in seconds, that a request will be delayed by
1999           this policy.  The default is 300 seconds. To read this value run:
2000           </para>
2001           <screen>lctl get_param {service}.nrs_delay_max</screen>
2002           <para>For example, to read the maximum delay set on the ost_io
2003           service, run:</para>
2004           <screen>
2005 $ lctl get_param ost.OSS.ost_io.nrs_delay_max
2006 ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
2007 hp_delay_max:300</screen>
2008         <para>To set the maximum delay in RPC processing, run:</para>
2009         <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
2010 </screen>
2011         <para>This will set the maximum delay time on a given service, for both
2012         regular and high-priority RPCs (if the PtlRPC service supports
2013         high-priority RPCs), to the indicated value.</para>
2014         <para>For example, to set the maximum delay time on the ost_io service
2015         to 60, run:</para>
2016         <screen>
2017 $ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
2018 ost.OSS.ost_io.nrs_delay_max=60</screen>
2019         <para>For PtlRPC services that support high-priority RPCs, to set a
2020         different maximum delay time for regular and high-priority RPCs, run:
2021         </para>
2022         <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
2023         <para>For example, to set the maximum delay time on the ost_io service
2024         for high-priority RPCs to 30, run:</para>
2025         <screen>
2026 $ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
2027 ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
2028         <para>Note, in all cases the maximum delay time cannot be less than the
2029         minimum delay time.</para>
2030         </listitem>
2031         <listitem>
2032           <para>
2033             <literal>{service}.nrs_delay_pct</literal>
2034           </para>
2035           <para>The
2036           <literal>{service}.nrs_delay_pct</literal> tunable controls the
2037           percentage of requests that will be delayed by this policy. The
2038           default is 100. Note, when a request is not selected for handling by
2039           the delay policy due to this variable then the request will be handled
2040           by whatever fallback policy is defined for that service. If no other
2041           fallback policy is defined then the request will be handled by the
2042           FIFO policy.  To read this value run:</para>
2043           <screen>lctl get_param {service}.nrs_delay_pct</screen>
2044           <para>For example, to read the percentage of requests being delayed on
2045           the ost_io service, run:</para>
2046           <screen>
2047 $ lctl get_param ost.OSS.ost_io.nrs_delay_pct
2048 ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
2049 hp_delay_pct:100</screen>
2050         <para>To set the percentage of delayed requests, run:</para>
2051         <screen>
2052 lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
2053         <para>This will set the percentage of requests delayed on a given
2054         service, for both regular and high-priority RPCs (if the PtlRPC service
2055         supports high-priority RPCs), to the indicated value.</para>
2056         <para>For example, to set the percentage of delayed requests on the
2057         ost_io service to 50, run:</para>
2058         <screen>
2059 $ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
2060 ost.OSS.ost_io.nrs_delay_pct=50
2061 </screen>
2062         <para>For PtlRPC services that support high-priority RPCs, to set a
2063         different delay percentage for regular and high-priority RPCs, run:
2064         </para>
2065         <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
2066 </screen>
2067         <para>For example, to set the percentage of delayed requests on the
2068         ost_io service for high-priority RPCs to 5, run:</para>
2069         <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
2070 ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
2071 </screen>
2072         </listitem>
2073       </itemizedlist>
2074     </section>
2075   </section>
2076   <section xml:id="dbdoclet.50438272_25884">
2077     <title>
2078     <indexterm>
2079       <primary>tuning</primary>
2080       <secondary>lockless I/O</secondary>
2081     </indexterm>Lockless I/O Tunables</title>
2082     <para>The lockless I/O tunable feature allows servers to ask clients to do
2083     lockless I/O (the server does the locking on behalf of clients) for
2084     contended files to avoid lock ping-pong.</para>
2085     <para>The lockless I/O patch introduces these tunables:</para>
2086     <itemizedlist>
2087       <listitem>
2088         <para>
2089           <emphasis role="bold">OST-side:</emphasis>
2090         </para>
2091         <screen>
2092 ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
2093 </screen>
2094         <para>
2095         <literal>contended_locks</literal>- If the number of lock conflicts in
2096         the scan of granted and waiting queues at contended_locks is exceeded,
2097         the resource is considered to be contended.</para>
2098         <para>
2099         <literal>contention_seconds</literal>- The resource keeps itself in a
2100         contended state as set in the parameter.</para>
2101         <para>
2102         <literal>max_nolock_bytes</literal>- Server-side locking set only for
2103         requests less than the blocks set in the
2104         <literal>max_nolock_bytes</literal> parameter. If this tunable is
2105         set to zero (0), it disables server-side locking for read/write
2106         requests.</para>
2107       </listitem>
2108       <listitem>
2109         <para>
2110           <emphasis role="bold">Client-side:</emphasis>
2111         </para>
2112         <screen>
2113 /proc/fs/lustre/llite/lustre-*
2114 </screen>
2115         <para>
2116         <literal>contention_seconds</literal>-
2117         <literal>llite</literal> inode remembers its contended state for the
2118         time specified in this parameter.</para>
2119       </listitem>
2120       <listitem>
2121         <para>
2122           <emphasis role="bold">Client-side statistics:</emphasis>
2123         </para>
2124         <para>The
2125         <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
2126         rows for lockless I/O statistics.</para>
2127         <para>
2128         <literal>lockless_read_bytes</literal> and
2129         <literal>lockless_write_bytes</literal>- To count the total bytes read
2130         or written, the client makes its own decisions based on the request
2131         size. The client does not communicate with the server if the request
2132         size is smaller than the
2133         <literal>min_nolock_size</literal>, without acquiring locks by the
2134         client.</para>
2135       </listitem>
2136     </itemizedlist>
2137   </section>
2138   <section condition="l29">
2139       <title>
2140         <indexterm>
2141           <primary>tuning</primary>
2142           <secondary>with lfs ladvise</secondary>
2143         </indexterm>
2144         Server-Side Advice and Hinting
2145       </title>
2146       <section><title>Overview</title>
2147       <para>Use the <literal>lfs ladvise</literal> command to give file access
2148       advices or hints to servers.</para>
2149       <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
2150 [--start|-s START[kMGT]]
2151 {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
2152 <emphasis>file</emphasis> ...
2153       </screen>
2154       <para>
2155         <informaltable frame="all">
2156           <tgroup cols="2">
2157           <colspec colname="c1" colwidth="50*"/>
2158           <colspec colname="c2" colwidth="50*"/>
2159           <thead>
2160             <row>
2161               <entry>
2162                 <para><emphasis role="bold">Option</emphasis></para>
2163               </entry>
2164               <entry>
2165                 <para><emphasis role="bold">Description</emphasis></para>
2166               </entry>
2167             </row>
2168           </thead>
2169           <tbody>
2170             <row>
2171               <entry>
2172                 <para><literal>-a</literal>, <literal>--advice=</literal>
2173                 <literal>ADVICE</literal></para>
2174               </entry>
2175               <entry>
2176                 <para>Give advice or hint of type <literal>ADVICE</literal>.
2177                 Advice types are:</para>
2178                 <para><literal>willread</literal> to prefetch data into server
2179                 cache</para>
2180                 <para><literal>dontneed</literal> to cleanup data cache on
2181                 server</para>
2182                 <para><literal>lockahead</literal> Request an LDLM extent lock
2183                 of the given mode on the given byte range </para>
2184                 <para><literal>noexpand</literal> Disable extent lock expansion
2185                 behavior for I/O to this file descriptor</para>
2186               </entry>
2187             </row>
2188             <row>
2189               <entry>
2190                 <para><literal>-b</literal>, <literal>--background</literal>
2191                 </para>
2192               </entry>
2193               <entry>
2194                 <para>Enable the advices to be sent and handled asynchronously.
2195                 </para>
2196               </entry>
2197             </row>
2198             <row>
2199               <entry>
2200                 <para><literal>-s</literal>, <literal>--start=</literal>
2201                         <literal>START_OFFSET</literal></para>
2202               </entry>
2203               <entry>
2204                 <para>File range starts from <literal>START_OFFSET</literal>
2205                 </para>
2206                 </entry>
2207             </row>
2208             <row>
2209                 <entry>
2210                     <para><literal>-e</literal>, <literal>--end=</literal>
2211                         <literal>END_OFFSET</literal></para>
2212                 </entry>
2213                 <entry>
2214                     <para>File range ends at (not including)
2215                     <literal>END_OFFSET</literal>.  This option may not be
2216                     specified at the same time as the <literal>-l</literal>
2217                     option.</para>
2218                 </entry>
2219             </row>
2220             <row>
2221                 <entry>
2222                     <para><literal>-l</literal>, <literal>--length=</literal>
2223                         <literal>LENGTH</literal></para>
2224                 </entry>
2225                 <entry>
2226                   <para>File range has length of <literal>LENGTH</literal>.
2227                   This option may not be specified at the same time as the
2228                   <literal>-e</literal> option.</para>
2229                 </entry>
2230             </row>
2231             <row>
2232                 <entry>
2233                     <para><literal>-m</literal>, <literal>--mode=</literal>
2234                         <literal>MODE</literal></para>
2235                 </entry>
2236                 <entry>
2237                   <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
2238                   Request a lock with this mode.</para>
2239                 </entry>
2240             </row>
2241           </tbody>
2242           </tgroup>
2243         </informaltable>
2244       </para>
2245       <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
2246       Lustre servers without guaranteeing when and what servers will react to
2247       the advice. Actions may or may not triggered when the advices are
2248       recieved, depending on the type of the advice, as well as the real-time
2249       decision of the affected server-side components.</para>
2250       <para>A typical usage of ladvise is to enable applications and users with
2251       external knowledge to intervene in server-side cache management. For
2252       example, if a bunch of different clients are doing small random reads of a
2253       file, prefetching pages into OSS cache with big linear reads before the
2254       random IO is a net benefit. Fetching that data into each client cache with
2255       fadvise() may not be, due to much more data being sent to the client.
2256       </para>
2257       <para>
2258       <literal>ladvise lockahead</literal> is different in that it attempts to
2259       control LDLM locking behavior by explicitly requesting LDLM locks in
2260       advance of use.  This does not directly affect caching behavior, instead
2261       it is used in special cases to avoid pathological results (lock exchange)
2262       from the normal LDLM locking behavior.
2263       </para>
2264       <para>
2265       Note that the <literal>noexpand</literal> advice works on a specific
2266       file descriptor, so using it via lfs has no effect.  It must be used
2267       on a particular file descriptor which is used for i/o to have any effect.
2268       </para>
2269       <para>The main difference between the Linux <literal>fadvise()</literal>
2270       system call and <literal>lfs ladvise</literal> is that
2271       <literal>fadvise()</literal> is only a client side mechanism that does
2272       not pass the advice to the filesystem, while <literal>ladvise</literal>
2273       can send advices or hints to the Lustre server side.</para>
2274       </section>
2275       <section><title>Examples</title>
2276         <para>The following example gives the OST(s) holding the first 1GB of
2277         <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
2278         file will be read soon.</para>
2279         <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
2280         </screen>
2281         <para>The following example gives the OST(s) holding the first 1GB of
2282         <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
2283         will not be read in the near future, thus the OST(s) could clear the
2284         cache of the file in the memory.</para>
2285         <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
2286         </screen>
2287         <para>The following example requests an LDLM read lock on the first
2288         1 MiB of <literal>/mnt/lustre/file1</literal>.  This will attempt to
2289         request a lock from the OST holding that region of the file.</para>
2290         <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
2291         </screen>
2292         <para>The following example requests an LDLM write lock on
2293         [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>.  This will
2294         attempt to request a lock from the OST holding that region of the
2295         file.</para>
2296         <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
2297         </screen>
2298       </section>
2299   </section>
2300   <section condition="l29">
2301       <title>
2302           <indexterm>
2303               <primary>tuning</primary>
2304               <secondary>Large Bulk IO</secondary>
2305           </indexterm>
2306           Large Bulk IO (16MB RPC)
2307       </title>
2308       <section><title>Overview</title>
2309           <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
2310           to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
2311           required to transfer the same amount of data between clients and
2312           servers.  With a larger RPC size, the OSS can submit more data to the
2313           underlying disks at once, therefore it can produce larger disk I/Os
2314           to fully utilize the increasing bandwidth of disks.</para>
2315           <para>At client connection time, clients will negotiate with
2316           servers what the maximum RPC size it is possible to use, but the
2317           client can always send RPCs smaller than this maximum.</para>
2318           <para>The parameter <literal>brw_size</literal> is used on the OST
2319           to tell the client the maximum (preferred) IO size.  All clients that
2320           talk to this target should never send an RPC greater than this size.
2321           Clients can individually set a smaller RPC size limit via the
2322           <literal>osc.*.max_pages_per_rpc</literal> tunable.
2323           </para>
2324           <note>
2325           <para>The smallest <literal>brw_size</literal> that can be set for
2326           ZFS OSTs is the <literal>recordsize</literal> of that dataset.  This
2327           ensures that the client can always write a full ZFS file block if it
2328           has enough dirty data, and does not otherwise force it to do read-
2329           modify-write operations for every RPC.
2330           </para>
2331           </note>
2332       </section>
2333       <section><title>Usage</title>
2334           <para>In order to enable a larger RPC size,
2335           <literal>brw_size</literal> must be changed to an IO size value up to
2336           16MB.  To temporarily change <literal>brw_size</literal>, the
2337           following command should be run on the OSS:</para>
2338           <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
2339           <para>To persistently change <literal>brw_size</literal>, the
2340           following command should be run:</para>
2341           <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
2342           <para>When a client connects to an OST target, it will fetch
2343           <literal>brw_size</literal> from the target and pick the maximum value
2344           of <literal>brw_size</literal> and its local setting for
2345           <literal>max_pages_per_rpc</literal> as the actual RPC size.
2346           Therefore, the <literal>max_pages_per_rpc</literal> on the client side
2347           would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
2348           a 16MB RPC.  To temporarily make the change, the following command
2349           should be run on the client to set
2350           <literal>max_pages_per_rpc</literal>:</para>
2351           <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
2352           <para>To persistently make this change, the following command should
2353           be run:</para>
2354           <screen>client$ lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
2355           <caution><para>The <literal>brw_size</literal> of an OST can be
2356           changed on the fly.  However, clients have to be remounted to
2357           renegotiate the new maximum RPC size.</para></caution>
2358       </section>
2359   </section>
2360   <section xml:id="dbdoclet.50438272_80545">
2361     <title>
2362     <indexterm>
2363       <primary>tuning</primary>
2364       <secondary>for small files</secondary>
2365     </indexterm>Improving Lustre I/O Performance for Small Files</title>
2366     <para>An environment where an application writes small file chunks from
2367     many clients to a single file can result in poor I/O performance. To
2368     improve the performance of the Lustre file system with small files:</para>
2369     <itemizedlist>
2370       <listitem>
2371         <para>Have the application aggregate writes some amount before
2372         submitting them to the Lustre file system. By default, the Lustre
2373         software enforces POSIX coherency semantics, so it results in lock
2374         ping-pong between client nodes if they are all writing to the same
2375         file at one time.</para>
2376         <para>Using MPI-IO Collective Write functionality in
2377         the Lustre ADIO driver is one way to achieve this in a straight
2378         forward manner if the application is already using MPI-IO.</para>
2379       </listitem>
2380       <listitem>
2381         <para>Have the application do 4kB
2382         <literal>O_DIRECT</literal> sized I/O to the file and disable locking
2383         on the output file. This avoids partial-page IO submissions and, by
2384         disabling locking, you avoid contention between clients.</para>
2385       </listitem>
2386       <listitem>
2387         <para>Have the application write contiguous data.</para>
2388       </listitem>
2389       <listitem>
2390         <para>Add more disks or use SSD disks for the OSTs. This dramatically
2391         improves the IOPS rate. Consider creating larger OSTs rather than many
2392         smaller OSTs due to less overhead (journal, connections, etc).</para>
2393       </listitem>
2394       <listitem>
2395         <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
2396         overhead for writing small chunks of data to disk.</para>
2397       </listitem>
2398     </itemizedlist>
2399   </section>
2400   <section xml:id="dbdoclet.50438272_45406">
2401     <title>
2402     <indexterm>
2403       <primary>tuning</primary>
2404       <secondary>write performance</secondary>
2405     </indexterm>Understanding Why Write Performance is Better Than Read
2406     Performance</title>
2407     <para>Typically, the performance of write operations on a Lustre cluster is
2408     better than read operations. When doing writes, all clients are sending
2409     write RPCs asynchronously. The RPCs are allocated, and written to disk in
2410     the order they arrive. In many cases, this allows the back-end storage to
2411     aggregate writes efficiently.</para>
2412     <para>In the case of read operations, the reads from clients may come in a
2413     different order and need a lot of seeking to get read from the disk. This
2414     noticeably hampers the read throughput.</para>
2415     <para>Currently, there is no readahead on the OSTs themselves, though the
2416     clients do readahead. If there are lots of clients doing reads it would not
2417     be possible to do any readahead in any case because of memory consumption
2418     (consider that even a single RPC (1 MB) readahead for 1000 clients would
2419     consume 1 GB of RAM).</para>
2420     <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
2421     there is also additional CPU overhead because the client cannot receive
2422     data without copying it from the network buffers. In the write case, the
2423     client CAN send data without the additional data copy. This means that the
2424     client is more likely to become CPU-bound during reads than writes.</para>
2425   </section>
2426 </chapter>