LustreTuning.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="lustretuning">
   5   <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
   6   <para>This chapter contains information about tuning a Lustre file system for
   7   better performance.</para>
   8   <note>
   9     <para>Many options in the Lustre software are set by means of kernel module
  10     parameters. These parameters are contained in the
  11     <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
  12   </note>
  13   <section xml:id="dbdoclet.50438272_55226">
  14     <title>
  15     <indexterm>
  16       <primary>tuning</primary>
  17     </indexterm>
  18     <indexterm>
  19       <primary>tuning</primary>
  20       <secondary>service threads</secondary>
  21     </indexterm>Optimizing the Number of Service Threads</title>
  22     <para>An OSS can have a minimum of two service threads and a maximum of 512
  23     service threads. The number of service threads is a function of how much
  24     RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
  25     If the load on the OSS node is high, new service threads will be started in
  26     order to process more requests concurrently, up to 4x the initial number of
  27     threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
  28     default thread count is 32 and the maximum thread count is 128.</para>
  29     <para>Increasing the size of the thread pool may help when:</para>
  30     <itemizedlist>
  31       <listitem>
  32         <para>Several OSTs are exported from a single OSS</para>
  33       </listitem>
  34       <listitem>
  35         <para>Back-end storage is running synchronously</para>
  36       </listitem>
  37       <listitem>
  38         <para>I/O completions take excessive time due to slow storage</para>
  39       </listitem>
  40     </itemizedlist>
  41     <para>Decreasing the size of the thread pool may help if:</para>
  42     <itemizedlist>
  43       <listitem>
  44         <para>Clients are overwhelming the storage capacity</para>
  45       </listitem>
  46       <listitem>
  47         <para>There are lots of "slow I/O" or similar messages</para>
  48       </listitem>
  49     </itemizedlist>
  50     <para>Increasing the number of I/O threads allows the kernel and storage to
  51     aggregate many writes together for more efficient disk I/O. The OSS thread
  52     pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
  53     size + 0.5 MB) for internal I/O buffers.</para>
  54     <para>It is very important to consider memory consumption when increasing
  55     the thread pool size. Drives are only able to sustain a certain amount of
  56     parallel I/O activity before performance is degraded, due to the high
  57     number of seeks and the OST threads just waiting for I/O. In this
  58     situation, it may be advisable to decrease the load by decreasing the
  59     number of OST threads.</para>
  60     <para>Determining the optimum number of OSS threads is a process of trial
  61     and error, and varies for each particular configuration. Variables include
  62     the number of OSTs on each OSS, number and speed of disks, RAID
  63     configuration, and available RAM. You may want to start with a number of
  64     OST threads equal to the number of actual disk spindles on the node. If you
  65     use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
  66     of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
  67     performance of clients during usual workloads. If performance is degraded,
  68     increase the thread count and see how that works until performance is
  69     degraded again or you reach satisfactory performance.</para>
  70     <note>
  71       <para>If there are too many threads, the latency for individual I/O
  72       requests can become very high and should be avoided. Set the desired
  73       maximum thread count permanently using the method described above.</para>
  74     </note>
  75     <section>
  76       <title>
  77       <indexterm>
  78         <primary>tuning</primary>
  79         <secondary>OSS threads</secondary>
  80       </indexterm>Specifying the OSS Service Thread Count</title>
  81       <para>The
  82       <literal>oss_num_threads</literal> parameter enables the number of OST
  83       service threads to be specified at module load time on the OSS
  84       nodes:</para>
  85       <screen>
  86 options ost oss_num_threads={N}
  87 </screen>
  88       <para>After startup, the minimum and maximum number of OSS thread counts
  89       can be set via the
  90       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
  91       the tunable at runtime, run:</para>
  92       <para>
  93         <screen>
  94 lctl {get,set}_param {service}.thread_{min,max,started}
  95 </screen>
  96       </para>
  97           <para condition='l23'>Lustre software release 2.3 introduced binding
  98       service threads to CPU partition. This works in a similar fashion to
  99       binding of threads on MDS. MDS thread tuning is covered in
 100       <xref linkend="dbdoclet.mdsbinding" />.</para>
 101       <itemizedlist>
 102         <listitem>
 103           <para>
 104           <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
 105           on CPTs defined by
 106           <literal>[EXPRESSION]</literal>.</para>
 107         </listitem>
 108         <listitem>
 109           <para>
 110           <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
 111           on CPTs defined by
 112           <literal>[EXPRESSION]</literal>.</para>
 113         </listitem>
 114       </itemizedlist>
 115       <para>For further details, see
 116       <xref linkend="dbdoclet.50438271_87260" />.</para>
 117     </section>
 118     <section xml:id="dbdoclet.mdstuning">
 119       <title>
 120       <indexterm>
 121         <primary>tuning</primary>
 122         <secondary>MDS threads</secondary>
 123       </indexterm>Specifying the MDS Service Thread Count</title>
 124       <para>The
 125       <literal>mds_num_threads</literal> parameter enables the number of MDS
 126       service threads to be specified at module load time on the MDS
 127       node:</para>
 128       <screen>
 129 options mds mds_num_threads={N}
 130 </screen>
 131       <para>After startup, the minimum and maximum number of MDS thread counts
 132       can be set via the
 133       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
 134       the tunable at runtime, run:</para>
 135       <para>
 136         <screen>
 137 lctl {get,set}_param {service}.thread_{min,max,started}
 138 </screen>
 139       </para>
 140       <para>For details, see
 141       <xref linkend="dbdoclet.50438271_87260" />.</para>
 142       <para>At this time, no testing has been done to determine the optimal
 143       number of MDS threads. The default value varies, based on server size, up
 144       to a maximum of 32. The maximum number of threads (
 145       <literal>MDS_MAX_THREADS</literal>) is 512.</para>
 146       <note>
 147         <para>The OSS and MDS automatically start new service threads
 148         dynamically, in response to server load within a factor of 4. The
 149         default value is calculated the same way as before. Setting the
 150         <literal>_mu_threads</literal> module parameter disables automatic
 151         thread creation behavior.</para>
 152       </note>
 153       <para>Lustre software release 2.3 introduced new parameters to provide
 154       more control to administrators.</para>
 155       <itemizedlist>
 156         <listitem>
 157           <para>
 158           <literal>mds_rdpg_num_threads</literal> controls the number of threads
 159           in providing the read page service. The read page service handles
 160           file close and readdir operations.</para>
 161         </listitem>
 162         <listitem>
 163           <para>
 164           <literal>mds_attr_num_threads</literal> controls the number of threads
 165           in providing the setattr service to clients running Lustre software
 166           release 1.8.</para>
 167         </listitem>
 168       </itemizedlist>
 169       <note>
 170         <para>Default values for the thread counts are automatically selected.
 171         The values are chosen to best exploit the number of CPUs present in the
 172         system and to provide best overall performance for typical
 173         workloads.</para>
 174       </note>
 175     </section>
 176   </section>
 177   <section xml:id="dbdoclet.mdsbinding" condition='l23'>
 178     <title>
 179     <indexterm>
 180       <primary>tuning</primary>
 181       <secondary>MDS binding</secondary>
 182     </indexterm>Binding MDS Service Thread to CPU Partitions</title>
 183     <para>With the introduction of Node Affinity (
 184     <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
 185     can be bound to particular CPU partitions (CPTs). Default values for
 186     bindings are selected automatically to provide good overall performance for
 187     a given CPU count. However, an administrator can deviate from these setting
 188     if they choose.</para>
 189     <itemizedlist>
 190       <listitem>
 191         <para>
 192         <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
 193         service threads to CPTs defined by
 194         <literal>EXPRESSION</literal>. For example
 195         <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
 196         to
 197         <literal>CPT[0,1,2,3]</literal>.</para>
 198       </listitem>
 199       <listitem>
 200         <para>
 201         <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
 202         service threads to CPTs defined by
 203         <literal>EXPRESSION</literal>. The read page service handles file close
 204         and readdir requests. For example
 205         <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
 206         to
 207         <literal>CPT4</literal>.</para>
 208       </listitem>
 209       <listitem>
 210         <para>
 211         <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
 212         service threads to CPTs defined by
 213         <literal>EXPRESSION</literal>.</para>
 214       </listitem>
 215     </itemizedlist>
 216         <para>Parameters must be set before module load in the file
 217     <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
 218     <example><title>lustre.conf</title>
 219     <screen>options lnet networks=tcp0(eth0)
 220 options mdt mds_num_cpts=[0]</screen>
 221     </example>
 222     </para>
 223   </section>
 224   <section xml:id="dbdoclet.50438272_73839">
 225     <title>
 226     <indexterm>
 227       <primary>LNET</primary>
 228       <secondary>tuning</secondary>
 229     </indexterm>
 230     <indexterm>
 231       <primary>tuning</primary>
 232       <secondary>LNET</secondary>
 233     </indexterm>Tuning LNET Parameters</title>
 234     <para>This section describes LNET tunables, the use of which may be
 235     necessary on some systems to improve performance. To test the performance
 236     of your Lustre network, see
 237     <xref linkend='lnetselftest' />.</para>
 238     <section remap="h3">
 239       <title>Transmit and Receive Buffer Size</title>
 240       <para>The kernel allocates buffers for sending and receiving messages on
 241       a network.</para>
 242       <para>
 243       <literal>ksocklnd</literal> has separate parameters for the transmit and
 244       receive buffers.</para>
 245       <screen>
 246 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
 247 </screen>
 248       <para>If these parameters are left at the default value (0), the system
 249       automatically tunes the transmit and receive buffer size. In almost every
 250       case, this default produces the best performance. Do not attempt to tune
 251       these parameters unless you are a network expert.</para>
 252     </section>
 253     <section remap="h3">
 254       <title>Hardware Interrupts (
 255       <literal>enable_irq_affinity</literal>)</title>
 256       <para>The hardware interrupts that are generated by network adapters may
 257       be handled by any CPU in the system. In some cases, we would like network
 258       traffic to remain local to a single CPU to help keep the processor cache
 259       warm and minimize the impact of context switches. This is helpful when an
 260       SMP system has more than one network interface and ideal when the number
 261       of interfaces equals the number of CPUs. To enable the
 262       <literal>enable_irq_affinity</literal> parameter, enter:</para>
 263       <screen>
 264 options ksocklnd enable_irq_affinity=1
 265 </screen>
 266       <para>In other cases, if you have an SMP platform with a single fast
 267       interface such as 10 Gb Ethernet and more than two CPUs, you may see
 268       performance improve by turning this parameter off.</para>
 269       <screen>
 270 options ksocklnd enable_irq_affinity=0
 271 </screen>
 272       <para>By default, this parameter is off. As always, you should test the
 273       performance to compare the impact of changing this parameter.</para>
 274     </section>
 275     <section condition='l23'>
 276       <title>
 277       <indexterm>
 278         <primary>tuning</primary>
 279         <secondary>Network interface binding</secondary>
 280       </indexterm>Binding Network Interface Against CPU Partitions</title>
 281       <para>Lustre software release 2.3 and beyond provide enhanced network
 282       interface control. The enhancement means that an administrator can bind
 283       an interface to one or more CPU partitions. Bindings are specified as
 284       options to the LNET modules. For more information on specifying module
 285       options, see
 286       <xref linkend="dbdoclet.50438293_15350" /></para>
 287       <para>For example,
 288       <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
 289       <literal>o2ib0</literal> will be handled by LND threads executing on
 290       <literal>CPT0</literal> and
 291       <literal>CPT1</literal>. An additional example might be:
 292       <literal>tcp1(eth0)[0]</literal>. Messages for
 293       <literal>tcp1</literal> are handled by threads on
 294       <literal>CPT0</literal>.</para>
 295     </section>
 296     <section>
 297       <title>
 298       <indexterm>
 299         <primary>tuning</primary>
 300         <secondary>Network interface credits</secondary>
 301       </indexterm>Network Interface Credits</title>
 302       <para>Network interface (NI) credits are shared across all CPU partitions
 303       (CPT). For example, if a machine has four CPTs and the number of NI
 304       credits is 512, then each partition has 128 credits. If a large number of
 305       CPTs exist on the system, LNET checks and validates the NI credits for
 306       each CPT to ensure each CPT has a workable number of credits. For
 307       example, if a machine has 16 CPTs and the number of NI credits is 256,
 308       then each partition only has 16 credits. 16 NI credits is low and could
 309       negatively impact performance. As a result, LNET automatically adjusts
 310       the credits to 8*
 311       <literal>peer_credits</literal>(
 312       <literal>peer_credits</literal> is 8 by default), so each partition has 64
 313       credits.</para>
 314       <para>Increasing the number of
 315       <literal>credits</literal>/
 316       <literal>peer_credits</literal> can improve the performance of high
 317       latency networks (at the cost of consuming more memory) by enabling LNET
 318       to send more inflight messages to a specific network/peer and keep the
 319       pipeline saturated.</para>
 320       <para>An administrator can modify the NI credit count using
 321       <literal>ksoclnd</literal> or
 322       <literal>ko2iblnd</literal>. In the example below, 256 credits are
 323       applied to TCP connections.</para>
 324       <screen>
 325 ksocklnd credits=256
 326 </screen>
 327       <para>Applying 256 credits to IB connections can be achieved with:</para>
 328       <screen>
 329 ko2iblnd credits=256
 330 </screen>
 331       <note condition="l23">
 332         <para>In Lustre software release 2.3 and beyond, LNET may revalidate
 333         the NI credits, so the administrator's request may not persist.</para>
 334       </note>
 335     </section>
 336     <section>
 337       <title>
 338       <indexterm>
 339         <primary>tuning</primary>
 340         <secondary>router buffers</secondary>
 341       </indexterm>Router Buffers</title>
 342       <para>When a node is set up as an LNET router, three pools of buffers are
 343       allocated: tiny, small and large. These pools are allocated per CPU
 344       partition and are used to buffer messages that arrive at the router to be
 345       forwarded to the next hop. The three different buffer sizes accommodate
 346       different size messages.</para>
 347       <para>If a message arrives that can fit in a tiny buffer then a tiny
 348       buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
 349       small buffer, then a small buffer is used. Finally if a message does not
 350       fit in either a tiny buffer or a small buffer, a large buffer is
 351       used.</para>
 352       <para>Router buffers are shared by all CPU partitions. For a machine with
 353       a large number of CPTs, the router buffer number may need to be specified
 354       manually for best performance. A low number of router buffers risks
 355       starving the CPU partitions of resources.</para>
 356       <itemizedlist>
 357         <listitem>
 358           <para>
 359           <literal>tiny_router_buffers</literal>: Zero payload buffers used for
 360           signals and acknowledgements.</para>
 361         </listitem>
 362         <listitem>
 363           <para>
 364           <literal>small_router_buffers</literal>: 4 KB payload buffers for
 365           small messages</para>
 366         </listitem>
 367         <listitem>
 368           <para>
 369           <literal>large_router_buffers</literal>: 1 MB maximum payload
 370           buffers, corresponding to the recommended RPC size of 1 MB.</para>
 371         </listitem>
 372       </itemizedlist>
 373       <para>The default setting for router buffers typically results in
 374       acceptable performance. LNET automatically sets a default value to reduce
 375       the likelihood of resource starvation. The size of a router buffer can be
 376       modified as shown in the example below. In this example, the size of the
 377       large buffer is modified using the
 378       <literal>large_router_buffers</literal> parameter.</para>
 379       <screen>
 380 lnet large_router_buffers=8192
 381 </screen>
 382       <note condition="l23">
 383         <para>In Lustre software release 2.3 and beyond, LNET may revalidate
 384         the router buffer setting, so the administrator's request may not
 385         persist.</para>
 386       </note>
 387     </section>
 388     <section>
 389       <title>
 390       <indexterm>
 391         <primary>tuning</primary>
 392         <secondary>portal round-robin</secondary>
 393       </indexterm>Portal Round-Robin</title>
 394       <para>Portal round-robin defines the policy LNET applies to deliver
 395       events and messages to the upper layers. The upper layers are PLRPC
 396       service or LNET selftest.</para>
 397       <para>If portal round-robin is disabled, LNET will deliver messages to
 398       CPTs based on a hash of the source NID. Hence, all messages from a
 399       specific peer will be handled by the same CPT. This can reduce data
 400       traffic between CPUs. However, for some workloads, this behavior may
 401       result in poorly balancing loads across the CPU.</para>
 402       <para>If portal round-robin is enabled, LNET will round-robin incoming
 403       events across all CPTs. This may balance load better across the CPU but
 404       can incur a cross CPU overhead.</para>
 405       <para>The current policy can be changed by an administrator with
 406       <literal>echo
 407       <replaceable>value</replaceable>&gt;
 408       /proc/sys/lnet/portal_rotor</literal>. There are four options for
 409       <literal>
 410         <replaceable>value</replaceable>
 411       </literal>:</para>
 412       <itemizedlist>
 413         <listitem>
 414           <para>
 415             <literal>OFF</literal>
 416           </para>
 417           <para>Disable portal round-robin on all incoming requests.</para>
 418         </listitem>
 419         <listitem>
 420           <para>
 421             <literal>ON</literal>
 422           </para>
 423           <para>Enable portal round-robin on all incoming requests.</para>
 424         </listitem>
 425         <listitem>
 426           <para>
 427             <literal>RR_RT</literal>
 428           </para>
 429           <para>Enable portal round-robin only for routed messages.</para>
 430         </listitem>
 431         <listitem>
 432           <para>
 433             <literal>HASH_RT</literal>
 434           </para>
 435           <para>Routed messages will be delivered to the upper layer by hash of
 436           source NID (instead of NID of router.) This is the default
 437           value.</para>
 438         </listitem>
 439       </itemizedlist>
 440     </section>
 441     <section>
 442       <title>LNET Peer Health</title>
 443       <para>Two options are available to help determine peer health:
 444       <itemizedlist>
 445         <listitem>
 446           <para>
 447           <literal>peer_timeout</literal>- The timeout (in seconds) before an
 448           aliveness query is sent to a peer. For example, if
 449           <literal>peer_timeout</literal> is set to
 450           <literal>180sec</literal>, an aliveness query is sent to the peer
 451           every 180 seconds. This feature only takes effect if the node is
 452           configured as an LNET router.</para>
 453           <para>In a routed environment, the
 454           <literal>peer_timeout</literal> feature should always be on (set to a
 455           value in seconds) on routers. If the router checker has been enabled,
 456           the feature should be turned off by setting it to 0 on clients and
 457           servers.</para>
 458           <para>For a non-routed scenario, enabling the
 459           <literal>peer_timeout</literal> option provides health information
 460           such as whether a peer is alive or not. For example, a client is able
 461           to determine if an MGS or OST is up when it sends it a message. If a
 462           response is received, the peer is alive; otherwise a timeout occurs
 463           when the request is made.</para>
 464           <para>In general,
 465           <literal>peer_timeout</literal> should be set to no less than the LND
 466           timeout setting. For more information about LND timeouts, see
 467           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 468           linkend="section_c24_nt5_dl" />.</para>
 469           <para>When the
 470           <literal>o2iblnd</literal>(IB) driver is used,
 471           <literal>peer_timeout</literal> should be at least twice the value of
 472           the
 473           <literal>ko2iblnd</literal> keepalive option. for more information
 474           about keepalive options, see
 475           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 476           linkend="section_ngq_qhy_zl" />.</para>
 477         </listitem>
 478         <listitem>
 479           <para>
 480           <literal>avoid_asym_router_failure</literal>– When set to 1, the
 481           router checker running on the client or a server periodically pings
 482           all the routers corresponding to the NIDs identified in the routes
 483           parameter setting on the node to determine the status of each router
 484           interface. The default setting is 1. (For more information about the
 485           LNET routes parameter, see
 486           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 487           linkend="dbdoclet.50438216_71227" /></para>
 488           <para>A router is considered down if any of its NIDs are down. For
 489           example, router X has three NIDs:
 490           <literal>Xnid1</literal>,
 491           <literal>Xnid2</literal>, and
 492           <literal>Xnid3</literal>. A client is connected to the router via
 493           <literal>Xnid1</literal>. The client has router checker enabled. The
 494           router checker periodically sends a ping to the router via
 495           <literal>Xnid1</literal>. The router responds to the ping with the
 496           status of each of its NIDs. In this case, it responds with
 497           <literal>Xnid1=up</literal>,
 498           <literal>Xnid2=up</literal>,
 499           <literal>Xnid3=down</literal>. If
 500           <literal>avoid_asym_router_failure==1</literal>, the router is
 501           considered down if any of its NIDs are down, so router X is
 502           considered down and will not be used for routing messages. If
 503           <literal>avoid_asym_router_failure==0</literal>, router X will
 504           continue to be used for routing messages.</para>
 505         </listitem>
 506       </itemizedlist></para>
 507       <para>The following router checker parameters must be set to the maximum
 508       value of the corresponding setting for this option on any client or
 509       server:
 510       <itemizedlist>
 511         <listitem>
 512           <para>
 513             <literal>dead_router_check_interval</literal>
 514           </para>
 515         </listitem>
 516         <listitem>
 517           <para>
 518             <literal>live_router_check_interval</literal>
 519           </para>
 520         </listitem>
 521         <listitem>
 522           <para>
 523             <literal>router_ping_timeout</literal>
 524           </para>
 525         </listitem>
 526       </itemizedlist></para>
 527       <para>For example, the
 528       <literal>dead_router_check_interval</literal> parameter on any router must
 529       be MAX.</para>
 530     </section>
 531   </section>
 532   <section xml:id="dbdoclet.libcfstuning">
 533     <title>
 534     <indexterm>
 535       <primary>tuning</primary>
 536       <secondary>libcfs</secondary>
 537     </indexterm>libcfs Tuning</title>
 538     <para>By default, the Lustre software will automatically generate CPU
 539     partitions (CPT) based on the number of CPUs in the system. The CPT number
 540     will be 1 if the online CPU number is less than five.</para>
 541     <para>The CPT number can be explicitly set on the libcfs module using
 542     <literal>cpu_npartitions=NUMBER</literal>. The value of
 543     <literal>cpu_npartitions</literal> must be an integer between 1 and the
 544     number of online CPUs.</para>
 545     <tip>
 546       <para>Setting CPT to 1 will disable most of the SMP Node Affinity
 547       functionality.</para>
 548     </tip>
 549     <section>
 550       <title>CPU Partition String Patterns</title>
 551       <para>CPU partitions can be described using string pattern notation. For
 552       example:</para>
 553       <itemizedlist>
 554         <listitem>
 555           <para>
 556             <literal>cpu_pattern="0[0,2,4,6] 1[1,3,5,7]</literal>
 557           </para>
 558           <para>Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains
 559           CPU[1,3,5,7].</para>
 560         </listitem>
 561         <listitem>
 562           <para>
 563             <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
 564           </para>
 565           <para>Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1
 566           contains all CPUs in NUMA node [4-7].</para>
 567         </listitem>
 568       </itemizedlist>
 569       <para>The current configuration of the CPU partition can be read from
 570       <literal>/proc/sys/lnet/cpu_partition_table</literal></para>
 571     </section>
 572   </section>
 573   <section xml:id="dbdoclet.lndtuning">
 574     <title>
 575     <indexterm>
 576       <primary>tuning</primary>
 577       <secondary>LND tuning</secondary>
 578     </indexterm>LND Tuning</title>
 579     <para>LND tuning allows the number of threads per CPU partition to be
 580     specified. An administrator can set the threads for both
 581     <literal>ko2iblnd</literal> and
 582     <literal>ksocklnd</literal> using the
 583     <literal>nscheds</literal> parameter. This adjusts the number of threads for
 584     each partition, not the overall number of threads on the LND.</para>
 585     <note>
 586       <para>Lustre software release 2.3 has greatly decreased the default
 587       number of threads for
 588       <literal>ko2iblnd</literal> and
 589       <literal>ksocklnd</literal> on high-core count machines. The current
 590       default values are automatically set and are chosen to work well across a
 591       number of typical scenarios.</para>
 592     </note>
 593   </section>
 594   <section xml:id="dbdoclet.nrstuning" condition='l24'>
 595     <title>
 596     <indexterm>
 597       <primary>tuning</primary>
 598       <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 599     </indexterm>Network Request Scheduler (NRS) Tuning</title>
 600     <para>The Network Request Scheduler (NRS) allows the administrator to
 601     influence the order in which RPCs are handled at servers, on a per-PTLRPC
 602     service basis, by providing different policies that can be activated and
 603     tuned in order to influence the RPC ordering. The aim of this is to provide
 604     for better performance, and possibly discrete performance characteristics
 605     using future policies.</para>
 606     <para>The NRS policy state of a PTLRPC service can be read and set via the
 607     <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
 608     service's NRS policy state, run:</para>
 609     <screen>
 610 lctl get_param {service}.nrs_policies
 611 </screen>
 612     <para>For example, to read the NRS policy state of the
 613     <literal>ost_io</literal> service, run:</para>
 614     <screen>
 615 $ lctl get_param ost.OSS.ost_io.nrs_policies
 616 ost.OSS.ost_io.nrs_policies=
 617
 618 regular_requests:
 619   - name: fifo
 620     state: started
 621     fallback: yes
 622     queued: 0
 623     active: 0
 624
 625   - name: crrn
 626     state: stopped
 627     fallback: no
 628     queued: 0
 629     active: 0
 630
 631   - name: orr
 632     state: stopped
 633     fallback: no
 634     queued: 0
 635     active: 0
 636
 637   - name: trr
 638     state: started
 639     fallback: no
 640     queued: 2420
 641     active: 268
 642
 643 high_priority_requests:
 644   - name: fifo
 645     state: started
 646     fallback: yes
 647     queued: 0
 648     active: 0
 649
 650   - name: crrn
 651     state: stopped
 652     fallback: no
 653     queued: 0
 654     active: 0
 655
 656   - name: orr
 657     state: stopped
 658     fallback: no
 659     queued: 0
 660     active: 0
 661
 662   - name: trr
 663     state: stopped
 664     fallback: no
 665     queued: 0
 666     active: 0
 667
 668 </screen>
 669     <para>NRS policy state is shown in either one or two sections, depending on
 670     the PTLRPC service being queried. The first section is named
 671     <literal>regular_requests</literal> and is available for all PTLRPC
 672     services, optionally followed by a second section which is named
 673     <literal>high_priority_requests</literal>. This is because some PTLRPC
 674     services are able to treat some types of RPCs as higher priority ones, such
 675     that they are handled by the server with higher priority compared to other,
 676     regular RPC traffic. For PTLRPC services that do not support high-priority
 677     RPCs, you will only see the
 678     <literal>regular_requests</literal> section.</para>
 679     <para>There is a separate instance of each NRS policy on each PTLRPC
 680     service for handling regular and high-priority RPCs (if the service
 681     supports high-priority RPCs). For each policy instance, the following
 682     fields are shown:</para>
 683     <informaltable frame="all">
 684       <tgroup cols="2">
 685         <colspec colname="c1" colwidth="50*" />
 686         <colspec colname="c2" colwidth="50*" />
 687         <thead>
 688           <row>
 689             <entry>
 690               <para>
 691                 <emphasis role="bold">Field</emphasis>
 692               </para>
 693             </entry>
 694             <entry>
 695               <para>
 696                 <emphasis role="bold">Description</emphasis>
 697               </para>
 698             </entry>
 699           </row>
 700         </thead>
 701         <tbody>
 702           <row>
 703             <entry>
 704               <para>
 705                 <literal>name</literal>
 706               </para>
 707             </entry>
 708             <entry>
 709               <para>The name of the policy.</para>
 710             </entry>
 711           </row>
 712           <row>
 713             <entry>
 714               <para>
 715                 <literal>state</literal>
 716               </para>
 717             </entry>
 718             <entry>
 719               <para>The state of the policy; this can be any of
 720               <literal>invalid, stopping, stopped, starting, started</literal>.
 721               A fully enabled policy is in the
 722               <literal>started</literal> state.</para>
 723             </entry>
 724           </row>
 725           <row>
 726             <entry>
 727               <para>
 728                 <literal>fallback</literal>
 729               </para>
 730             </entry>
 731             <entry>
 732               <para>Whether the policy is acting as a fallback policy or not. A
 733               fallback policy is used to handle RPCs that other enabled
 734               policies fail to handle, or do not support the handling of. The
 735               possible values are
 736               <literal>no, yes</literal>. Currently, only the FIFO policy can
 737               act as a fallback policy.</para>
 738             </entry>
 739           </row>
 740           <row>
 741             <entry>
 742               <para>
 743                 <literal>queued</literal>
 744               </para>
 745             </entry>
 746             <entry>
 747               <para>The number of RPCs that the policy has waiting to be
 748               serviced.</para>
 749             </entry>
 750           </row>
 751           <row>
 752             <entry>
 753               <para>
 754                 <literal>active</literal>
 755               </para>
 756             </entry>
 757             <entry>
 758               <para>The number of RPCs that the policy is currently
 759               handling.</para>
 760             </entry>
 761           </row>
 762         </tbody>
 763       </tgroup>
 764     </informaltable>
 765     <para>To enable an NRS policy on a PTLRPC service run:</para>
 766     <screen>
 767 lctl set_param {service}.nrs_policies=
 768 <replaceable>policy_name</replaceable>
 769 </screen>
 770     <para>This will enable the policy
 771     <replaceable>policy_name</replaceable>for both regular and high-priority
 772     RPCs (if the PLRPC service supports high-priority RPCs) on the given
 773     service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
 774     service, run:</para>
 775     <screen>
 776 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
 777 ldlm.services.ldlm_cbd.nrs_policies=crrn
 778
 779 </screen>
 780     <para>For PTLRPC services that support high-priority RPCs, you can also
 781     supply an optional
 782     <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
 783     for handling only regular or high-priority RPCs on a given PTLRPC service,
 784     by running:</para>
 785     <screen>
 786 lctl set_param {service}.nrs_policies="
 787 <replaceable>policy_name</replaceable>
 788 <replaceable>reg|hp</replaceable>"
 789 </screen>
 790     <para>For example, to enable the TRR policy for handling only regular, but
 791     not high-priority RPCs on the
 792     <literal>ost_io</literal> service, run:</para>
 793     <screen>
 794 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
 795 ost.OSS.ost_io.nrs_policies="trr reg"
 796
 797 </screen>
 798     <note>
 799       <para>When enabling an NRS policy, the policy name must be given in
 800       lower-case characters, otherwise the operation will fail with an error
 801       message.</para>
 802     </note>
 803     <section>
 804       <title>
 805       <indexterm>
 806         <primary>tuning</primary>
 807         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 808         <tertiary>first in, first out (FIFO) policy</tertiary>
 809       </indexterm>First In, First Out (FIFO) policy</title>
 810       <para>The first in, first out (FIFO) policy handles RPCs in a service in
 811       the same order as they arrive from the LNET layer, so no special
 812       processing takes place to modify the RPC handling stream. FIFO is the
 813       default policy for all types of RPCs on all PTLRPC services, and is
 814       always enabled irrespective of the state of other policies, so that it
 815       can be used as a backup policy, in case a more elaborate policy that has
 816       been enabled fails to handle an RPC, or does not support handling a given
 817       type of RPC.</para>
 818       <para>The FIFO policy has no tunables that adjust its behaviour.</para>
 819     </section>
 820     <section>
 821       <title>
 822       <indexterm>
 823         <primary>tuning</primary>
 824         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 825         <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
 826       </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
 827       <para>The client round-robin over NIDs (CRR-N) policy performs batched
 828       round-robin scheduling of all types of RPCs, with each batch consisting
 829       of RPCs originating from the same client node, as identified by its NID.
 830       CRR-N aims to provide for better resource utilization across the cluster,
 831       and to help shorten completion times of jobs in some cases, by
 832       distributing available bandwidth more evenly across all clients.</para>
 833       <para>The CRR-N policy can be enabled on all types of PTLRPC services,
 834       and has the following tunable that can be used to adjust its
 835       behavior:</para>
 836       <itemizedlist>
 837         <listitem>
 838           <para>
 839             <literal>{service}.nrs_crrn_quantum</literal>
 840           </para>
 841           <para>The
 842           <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
 843           maximum allowed size of each batch of RPCs; the unit of measure is in
 844           number of RPCs. To read the maximum allowed batch size of a CRR-N
 845           policy, run:</para>
 846           <screen>
 847 lctl get_param {service}.nrs_crrn_quantum
 848 </screen>
 849           <para>For example, to read the maximum allowed batch size of a CRR-N
 850           policy on the ost_io service, run:</para>
 851           <screen>
 852 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
 853 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
 854 hp_quantum:8
 855
 856 </screen>
 857           <para>You can see that there is a separate maximum allowed batch size
 858           value for regular (
 859           <literal>reg_quantum</literal>) and high-priority (
 860           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
 861           high-priority RPCs).</para>
 862           <para>To set the maximum allowed batch size of a CRR-N policy on a
 863           given service, run:</para>
 864           <screen>
 865 lctl set_param {service}.nrs_crrn_quantum=
 866 <replaceable>1-65535</replaceable>
 867 </screen>
 868           <para>This will set the maximum allowed batch size on a given
 869           service, for both regular and high-priority RPCs (if the PLRPC
 870           service supports high-priority RPCs), to the indicated value.</para>
 871           <para>For example, to set the maximum allowed batch size on the
 872           ldlm_canceld service to 16 RPCs, run:</para>
 873           <screen>
 874 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 875 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 876
 877 </screen>
 878           <para>For PTLRPC services that support high-priority RPCs, you can
 879           also specify a different maximum allowed batch size for regular and
 880           high-priority RPCs, by running:</para>
 881           <screen>
 882 $ lctl set_param {service}.nrs_crrn_quantum=
 883 <replaceable>reg_quantum|hp_quantum</replaceable>:
 884 <replaceable>1-65535</replaceable>"
 885 </screen>
 886           <para>For example, to set the maximum allowed batch size on the
 887           ldlm_canceld service, for high-priority RPCs to 32, run:</para>
 888           <screen>
 889 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
 890 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
 891
 892 </screen>
 893           <para>By using the last method, you can also set the maximum regular
 894           and high-priority RPC batch sizes to different values, in a single
 895           command invocation.</para>
 896         </listitem>
 897       </itemizedlist>
 898     </section>
 899     <section>
 900       <title>
 901       <indexterm>
 902         <primary>tuning</primary>
 903         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 904         <tertiary>object-based round-robin (ORR) policy</tertiary>
 905       </indexterm>Object-based Round-Robin (ORR) policy</title>
 906       <para>The object-based round-robin (ORR) policy performs batched
 907       round-robin scheduling of bulk read write (brw) RPCs, with each batch
 908       consisting of RPCs that pertain to the same backend-file system object,
 909       as identified by its OST FID.</para>
 910       <para>The ORR policy is only available for use on the ost_io service. The
 911       RPC batches it forms can potentially consist of mixed bulk read and bulk
 912       write RPCs. The RPCs in each batch are ordered in an ascending manner,
 913       based on either the file offsets, or the physical disk offsets of each
 914       RPC (only applicable to bulk read RPCs).</para>
 915       <para>The aim of the ORR policy is to provide for increased bulk read
 916       throughput in some cases, by ordering bulk read RPCs (and potentially
 917       bulk write RPCs), and thus minimizing costly disk seek operations.
 918       Performance may also benefit from any resulting improvement in resource
 919       utilization, or by taking advantage of better locality of reference
 920       between RPCs.</para>
 921       <para>The ORR policy has the following tunables that can be used to
 922       adjust its behaviour:</para>
 923       <itemizedlist>
 924         <listitem>
 925           <para>
 926             <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
 927           </para>
 928           <para>The
 929           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
 930           the maximum allowed size of each batch of RPCs; the unit of measure
 931           is in number of RPCs. To read the maximum allowed batch size of the
 932           ORR policy, run:</para>
 933           <screen>
 934 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
 935 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
 936 hp_quantum:16
 937
 938 </screen>
 939           <para>You can see that there is a separate maximum allowed batch size
 940           value for regular (
 941           <literal>reg_quantum</literal>) and high-priority (
 942           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
 943           high-priority RPCs).</para>
 944           <para>To set the maximum allowed batch size for the ORR policy,
 945           run:</para>
 946           <screen>
 947 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
 948 <replaceable>1-65535</replaceable>
 949 </screen>
 950           <para>This will set the maximum allowed batch size for both regular
 951           and high-priority RPCs, to the indicated value.</para>
 952           <para>You can also specify a different maximum allowed batch size for
 953           regular and high-priority RPCs, by running:</para>
 954           <screen>
 955 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
 956 <replaceable>reg_quantum|hp_quantum</replaceable>:
 957 <replaceable>1-65535</replaceable>
 958 </screen>
 959           <para>For example, to set the maximum allowed batch size for regular
 960           RPCs to 128, run:</para>
 961           <screen>
 962 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
 963 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
 964
 965 </screen>
 966           <para>By using the last method, you can also set the maximum regular
 967           and high-priority RPC batch sizes to different values, in a single
 968           command invocation.</para>
 969         </listitem>
 970         <listitem>
 971           <para>
 972             <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
 973           </para>
 974           <para>The
 975           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
 976           determines whether the ORR policy orders RPCs within each batch based
 977           on logical file offsets or physical disk offsets. To read the offset
 978           type value for the ORR policy, run:</para>
 979           <screen>
 980 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
 981 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
 982 hp_offset_type:logical
 983
 984 </screen>
 985           <para>You can see that there is a separate offset type value for
 986           regular (
 987           <literal>reg_offset_type</literal>) and high-priority (
 988           <literal>hp_offset_type</literal>) RPCs.</para>
 989           <para>To set the ordering type for the ORR policy, run:</para>
 990           <screen>
 991 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
 992 <replaceable>physical|logical</replaceable>
 993 </screen>
 994           <para>This will set the offset type for both regular and
 995           high-priority RPCs, to the indicated value.</para>
 996           <para>You can also specify a different offset type for regular and
 997           high-priority RPCs, by running:</para>
 998           <screen>
 999 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1000 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
1001 <replaceable>physical|logical</replaceable>
1002 </screen>
1003           <para>For example, to set the offset type for high-priority RPCs to
1004           physical disk offsets, run:</para>
1005           <screen>
1006 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1007 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1008 </screen>
1009           <para>By using the last method, you can also set offset type for
1010           regular and high-priority RPCs to different values, in a single
1011           command invocation.</para>
1012           <note>
1013             <para>Irrespective of the value of this tunable, only logical
1014             offsets can, and are used for ordering bulk write RPCs.</para>
1015           </note>
1016         </listitem>
1017         <listitem>
1018           <para>
1019             <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1020           </para>
1021           <para>The
1022           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1023           the type of RPCs that the ORR policy will handle. To read the types
1024           of supported RPCs by the ORR policy, run:</para>
1025           <screen>
1026 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1027 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1028 hp_supported=reads_and_writes
1029
1030 </screen>
1031           <para>You can see that there is a separate supported 'RPC types'
1032           value for regular (
1033           <literal>reg_supported</literal>) and high-priority (
1034           <literal>hp_supported</literal>) RPCs.</para>
1035           <para>To set the supported RPC types for the ORR policy, run:</para>
1036           <screen>
1037 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1038 <replaceable>reads|writes|reads_and_writes</replaceable>
1039 </screen>
1040           <para>This will set the supported RPC types for both regular and
1041           high-priority RPCs, to the indicated value.</para>
1042           <para>You can also specify a different supported 'RPC types' value
1043           for regular and high-priority RPCs, by running:</para>
1044           <screen>
1045 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1046 <replaceable>reg_supported|hp_supported</replaceable>:
1047 <replaceable>reads|writes|reads_and_writes</replaceable>
1048 </screen>
1049           <para>For example, to set the supported RPC types to bulk read and
1050           bulk write RPCs for regular requests, run:</para>
1051           <screen>
1052 $ lctl set_param
1053 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1054 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1055
1056 </screen>
1057           <para>By using the last method, you can also set the supported RPC
1058           types for regular and high-priority RPC to different values, in a
1059           single command invocation.</para>
1060         </listitem>
1061       </itemizedlist>
1062     </section>
1063     <section>
1064       <title>
1065       <indexterm>
1066         <primary>tuning</primary>
1067         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1068         <tertiary>Target-based round-robin (TRR) policy</tertiary>
1069       </indexterm>Target-based Round-Robin (TRR) policy</title>
1070       <para>The target-based round-robin (TRR) policy performs batched
1071       round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1072       that pertain to the same OST, as identified by its OST index.</para>
1073       <para>The TRR policy is identical to the object-based round-robin (ORR)
1074       policy, apart from using the brw RPC's target OST index instead of the
1075       backend-fs object's OST FID, for determining the RPC scheduling order.
1076       The goals of TRR are effectively the same as for ORR, and it uses the
1077       following tunables to adjust its behaviour:</para>
1078       <itemizedlist>
1079         <listitem>
1080           <para>
1081             <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1082           </para>
1083           <para>The purpose of this tunable is exactly the same as for the
1084           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1085           policy, and you can use it in exactly the same way.</para>
1086         </listitem>
1087         <listitem>
1088           <para>
1089             <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1090           </para>
1091           <para>The purpose of this tunable is exactly the same as for the
1092           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1093           ORR policy, and you can use it in exactly the same way.</para>
1094         </listitem>
1095         <listitem>
1096           <para>
1097             <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1098           </para>
1099           <para>The purpose of this tunable is exactly the same as for the
1100           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1101           ORR policy, and you can use it in exactly the sme way.</para>
1102         </listitem>
1103       </itemizedlist>
1104     </section>
1105     <section condition='l26'>
1106       <title>
1107       <indexterm>
1108         <primary>tuning</primary>
1109         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1110         <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1111       </indexterm>Token Bucket Filter (TBF) policy</title>
1112       <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1113       Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1114       (Quality of Service) purposes.</para>
1115       <figure>
1116         <title>The internal structure of TBF policy</title>
1117         <mediaobject>
1118           <imageobject>
1119             <imagedata scalefit="1" width="100%"
1120             fileref="figures/TBF_policy.svg" />
1121           </imageobject>
1122           <textobject>
1123             <phrase>The internal structure of TBF policy</phrase>
1124           </textobject>
1125         </mediaobject>
1126       </figure>
1127       <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1128       according to its classification. The classification of RPC requests is
1129       based on either NID or JobID of the RPC according to the configure of
1130       TBF. TBF policy maintains multiple queues in the system, one queue for
1131       each category in the classification of RPC requests. The requests waits
1132       for tokens in the FIFO queue before they have been handled so as to keep
1133       the RPC rates under the limits.</para>
1134       <para>When Lustre services are too busy to handle all of the requests in
1135       time, all of the specified rates of the queues will not be satisfied.
1136       Nothing bad will happen except some of the RPC rates are slower than
1137       configured. In this case, the queue with higher rate will have an
1138       advantage over the queues with lower rates, but none of them will be
1139       starved.</para>
1140       <para>To manage the RPC rate of queues, we don't need to set the rate of
1141       each queue manually. Instead, we define rules which TBF policy matches to
1142       determine RPC rate limits. All of the defined rules are organized as an
1143       ordered list. Whenever a queue is newly created, it goes though the rule
1144       list and takes the first matched rule as its rule, so that the queue
1145       knows its RPC token rate. A rule can be added to or removed from the list
1146       at run time. Whenever the list of rules is changed, the queues will
1147       update their matched rules.</para>
1148       <itemizedlist>
1149         <listitem>
1150           <para>
1151             <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
1152           </para>
1153           <para>The format of the rule start command of TBF policy is as
1154           follows:</para>
1155           <screen>
1156 $ lctl set_param x.x.x.nrs_tbf_rule=
1157                   "[reg|hp] start
1158 <replaceable>rule_name</replaceable>
1159 <replaceable>arguments</replaceable>..."
1160 </screen>
1161           <para>The '
1162           <replaceable>rule_name</replaceable>' argument is a string which
1163           identifies a rule. The format of the '
1164           <replaceable>arguments</replaceable>' is changing according to the
1165           type of the TBF policy. For the NID based TBF policy, its format is
1166           as follows:</para>
1167           <screen>
1168 $ lctl set_param x.x.x.nrs_tbf_rule=
1169                   "[reg|hp] start
1170 <replaceable>rule_name</replaceable> {
1171 <replaceable>nidlist</replaceable>}
1172 <replaceable>rate</replaceable>"
1173 </screen>
1174           <para>The format of '
1175           <replaceable>nidlist</replaceable>' argument is the same as the
1176           format when configuring LNET route. The '
1177           <replaceable>rate</replaceable>' argument is the RPC rate of the
1178           rule, means the upper limit number of requests per second.</para>
1179           <para>Following commands are valid. Please note that a newly started
1180           rule is prior to old rules, so the order of starting rules is
1181           critical too.</para>
1182           <screen>
1183 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1184                   "start other_clients {192.168.*.*@tcp} 50"
1185 </screen>
1186           <screen>
1187 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1188                   "start loginnode {192.168.1.1@tcp} 100"
1189 </screen>
1190           <para>General rule can be replaced by two rules (reg and hp) as
1191           follows:</para>
1192           <screen>
1193 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1194                   "reg start loginnode {192.168.1.1@tcp} 100"
1195 </screen>
1196           <screen>
1197 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1198                   "hp start loginnode {192.168.1.1@tcp} 100"
1199 </screen>
1200           <screen>
1201 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1202                   "start computes {192.168.1.[2-128]@tcp} 500"
1203 </screen>
1204           <para>The above rules will put an upper limit for servers to process
1205           at most 5x as many RPCs from compute nodes as login nodes.</para>
1206           <para>For the JobID (please see
1207           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1208           linkend="dbdoclet.jobstats" />for more details) based TBF policy, its
1209           format is as follows:</para>
1210           <screen>
1211 $ lctl set_param x.x.x.nrs_tbf_rule=
1212                   "[reg|hp] start
1213 <replaceable>name</replaceable> {
1214 <replaceable>jobid_list</replaceable>}
1215 <replaceable>rate</replaceable>"
1216 </screen>
1217           <para>Following commands are valid:</para>
1218           <screen>
1219 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1220                   "start user1 {iozone.500 dd.500} 100"
1221 </screen>
1222           <screen>
1223 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1224                   "start iozone_user1 {iozone.500} 100"
1225 </screen>
1226           <para>Same as nid, could use reg and hp rules separately:</para>
1227           <screen>
1228 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1229                   "hp start iozone_user1 {iozone.500} 100"
1230 </screen>
1231           <screen>
1232 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1233                   "reg start iozone_user1 {iozone.500} 100"
1234 </screen>
1235           <para>The format of the rule change command of TBF policy is as
1236           follows:</para>
1237           <screen>
1238 $ lctl set_param x.x.x.nrs_tbf_rule=
1239                   "[reg|hp] change
1240 <replaceable>rule_name</replaceable>
1241 <replaceable>rate</replaceable>"
1242 </screen>
1243           <para>Following commands are valid:</para>
1244           <screen>
1245 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
1246 </screen>
1247           <screen>
1248 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
1249 </screen>
1250           <screen>
1251 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
1252 </screen>
1253           <para>The format of the rule stop command of TBF policy is as
1254           follows:</para>
1255           <screen>
1256 $ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1257 <replaceable>rule_name</replaceable>"
1258 </screen>
1259           <para>Following commands are valid:</para>
1260           <screen>
1261 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1262 </screen>
1263           <screen>
1264 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1265 </screen>
1266           <screen>
1267 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
1268 </screen>
1269         </listitem>
1270       </itemizedlist>
1271     </section>
1272   </section>
1273   <section xml:id="dbdoclet.50438272_25884">
1274     <title>
1275     <indexterm>
1276       <primary>tuning</primary>
1277       <secondary>lockless I/O</secondary>
1278     </indexterm>Lockless I/O Tunables</title>
1279     <para>The lockless I/O tunable feature allows servers to ask clients to do
1280     lockless I/O (liblustre-style where the server does the locking) on
1281     contended files.</para>
1282     <para>The lockless I/O patch introduces these tunables:</para>
1283     <itemizedlist>
1284       <listitem>
1285         <para>
1286           <emphasis role="bold">OST-side:</emphasis>
1287         </para>
1288         <screen>
1289 /proc/fs/lustre/ldlm/namespaces/filter-lustre-*
1290 </screen>
1291         <para>
1292         <literal>contended_locks</literal>- If the number of lock conflicts in
1293         the scan of granted and waiting queues at contended_locks is exceeded,
1294         the resource is considered to be contended.</para>
1295         <para>
1296         <literal>contention_seconds</literal>- The resource keeps itself in a
1297         contended state as set in the parameter.</para>
1298         <para>
1299         <literal>max_nolock_bytes</literal>- Server-side locking set only for
1300         requests less than the blocks set in the
1301         <literal>max_nolock_bytes</literal> parameter. If this tunable is set to
1302         zero (0), it disables server-side locking for read/write
1303         requests.</para>
1304       </listitem>
1305       <listitem>
1306         <para>
1307           <emphasis role="bold">Client-side:</emphasis>
1308         </para>
1309         <screen>
1310 /proc/fs/lustre/llite/lustre-*
1311 </screen>
1312         <para>
1313         <literal>contention_seconds</literal>-
1314         <literal>llite</literal> inode remembers its contended state for the
1315         time specified in this parameter.</para>
1316       </listitem>
1317       <listitem>
1318         <para>
1319           <emphasis role="bold">Client-side statistics:</emphasis>
1320         </para>
1321         <para>The
1322         <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
1323         rows for lockless I/O statistics.</para>
1324         <para>
1325         <literal>lockless_read_bytes</literal> and
1326         <literal>lockless_write_bytes</literal>- To count the total bytes read
1327         or written, the client makes its own decisions based on the request
1328         size. The client does not communicate with the server if the request
1329         size is smaller than the
1330         <literal>min_nolock_size</literal>, without acquiring locks by the
1331         client.</para>
1332       </listitem>
1333     </itemizedlist>
1334   </section>
1335   <section xml:id="dbdoclet.50438272_80545">
1336     <title>
1337     <indexterm>
1338       <primary>tuning</primary>
1339       <secondary>for small files</secondary>
1340     </indexterm>Improving Lustre File System Performance When Working with
1341     Small Files</title>
1342     <para>An environment where an application writes small file chunks from
1343     many clients to a single file will result in bad I/O performance. To
1344     improve the performance of the Lustre file system with small files:</para>
1345     <itemizedlist>
1346       <listitem>
1347         <para>Have the application aggregate writes some amount before
1348         submitting them to the Lustre file system. By default, the Lustre
1349         software enforces POSIX coherency semantics, so it results in lock
1350         ping-pong between client nodes if they are all writing to the same file
1351         at one time.</para>
1352       </listitem>
1353       <listitem>
1354         <para>Have the application do 4kB
1355         <literal>O_DIRECT</literal> sized I/O to the file and disable locking on
1356         the output file. This avoids partial-page IO submissions and, by
1357         disabling locking, you avoid contention between clients.</para>
1358       </listitem>
1359       <listitem>
1360         <para>Have the application write contiguous data.</para>
1361       </listitem>
1362       <listitem>
1363         <para>Add more disks or use SSD disks for the OSTs. This dramatically
1364         improves the IOPS rate. Consider creating larger OSTs rather than many
1365         smaller OSTs due to less overhead (journal, connections, etc).</para>
1366       </listitem>
1367       <listitem>
1368         <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
1369         overhead for writing small chunks of data to disk.</para>
1370       </listitem>
1371     </itemizedlist>
1372   </section>
1373   <section xml:id="dbdoclet.50438272_45406">
1374     <title>
1375     <indexterm>
1376       <primary>tuning</primary>
1377       <secondary>write performance</secondary>
1378     </indexterm>Understanding Why Write Performance is Better Than Read
1379     Performance</title>
1380     <para>Typically, the performance of write operations on a Lustre cluster is
1381     better than read operations. When doing writes, all clients are sending
1382     write RPCs asynchronously. The RPCs are allocated, and written to disk in
1383     the order they arrive. In many cases, this allows the back-end storage to
1384     aggregate writes efficiently.</para>
1385     <para>In the case of read operations, the reads from clients may come in a
1386     different order and need a lot of seeking to get read from the disk. This
1387     noticeably hampers the read throughput.</para>
1388     <para>Currently, there is no readahead on the OSTs themselves, though the
1389     clients do readahead. If there are lots of clients doing reads it would not
1390     be possible to do any readahead in any case because of memory consumption
1391     (consider that even a single RPC (1 MB) readahead for 1000 clients would
1392     consume 1 GB of RAM).</para>
1393     <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
1394     there is also additional CPU overhead because the client cannot receive
1395     data without copying it from the network buffers. In the write case, the
1396     client CAN send data without the additional data copy. This means that the
1397     client is more likely to become CPU-bound during reads than writes.</para>
1398   </section>
1399 </chapter>