LustreTuning.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="lustretuning">
   5   <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
   6   <para>This chapter contains information about tuning a Lustre file system for
   7   better performance.</para>
   8   <note>
   9     <para>Many options in the Lustre software are set by means of kernel module
  10     parameters. These parameters are contained in the
  11     <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
  12   </note>
  13   <section xml:id="dbdoclet.50438272_55226">
  14     <title>
  15     <indexterm>
  16       <primary>tuning</primary>
  17     </indexterm>
  18     <indexterm>
  19       <primary>tuning</primary>
  20       <secondary>service threads</secondary>
  21     </indexterm>Optimizing the Number of Service Threads</title>
  22     <para>An OSS can have a minimum of two service threads and a maximum of 512
  23     service threads. The number of service threads is a function of how much
  24     RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
  25     If the load on the OSS node is high, new service threads will be started in
  26     order to process more requests concurrently, up to 4x the initial number of
  27     threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
  28     default thread count is 32 and the maximum thread count is 128.</para>
  29     <para>Increasing the size of the thread pool may help when:</para>
  30     <itemizedlist>
  31       <listitem>
  32         <para>Several OSTs are exported from a single OSS</para>
  33       </listitem>
  34       <listitem>
  35         <para>Back-end storage is running synchronously</para>
  36       </listitem>
  37       <listitem>
  38         <para>I/O completions take excessive time due to slow storage</para>
  39       </listitem>
  40     </itemizedlist>
  41     <para>Decreasing the size of the thread pool may help if:</para>
  42     <itemizedlist>
  43       <listitem>
  44         <para>Clients are overwhelming the storage capacity</para>
  45       </listitem>
  46       <listitem>
  47         <para>There are lots of "slow I/O" or similar messages</para>
  48       </listitem>
  49     </itemizedlist>
  50     <para>Increasing the number of I/O threads allows the kernel and storage to
  51     aggregate many writes together for more efficient disk I/O. The OSS thread
  52     pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
  53     size + 0.5 MB) for internal I/O buffers.</para>
  54     <para>It is very important to consider memory consumption when increasing
  55     the thread pool size. Drives are only able to sustain a certain amount of
  56     parallel I/O activity before performance is degraded, due to the high
  57     number of seeks and the OST threads just waiting for I/O. In this
  58     situation, it may be advisable to decrease the load by decreasing the
  59     number of OST threads.</para>
  60     <para>Determining the optimum number of OSS threads is a process of trial
  61     and error, and varies for each particular configuration. Variables include
  62     the number of OSTs on each OSS, number and speed of disks, RAID
  63     configuration, and available RAM. You may want to start with a number of
  64     OST threads equal to the number of actual disk spindles on the node. If you
  65     use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
  66     of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
  67     performance of clients during usual workloads. If performance is degraded,
  68     increase the thread count and see how that works until performance is
  69     degraded again or you reach satisfactory performance.</para>
  70     <note>
  71       <para>If there are too many threads, the latency for individual I/O
  72       requests can become very high and should be avoided. Set the desired
  73       maximum thread count permanently using the method described above.</para>
  74     </note>
  75     <section>
  76       <title>
  77       <indexterm>
  78         <primary>tuning</primary>
  79         <secondary>OSS threads</secondary>
  80       </indexterm>Specifying the OSS Service Thread Count</title>
  81       <para>The
  82       <literal>oss_num_threads</literal> parameter enables the number of OST
  83       service threads to be specified at module load time on the OSS
  84       nodes:</para>
  85       <screen>
  86 options ost oss_num_threads={N}
  87 </screen>
  88       <para>After startup, the minimum and maximum number of OSS thread counts
  89       can be set via the
  90       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
  91       the tunable at runtime, run:</para>
  92       <para>
  93         <screen>
  94 lctl {get,set}_param {service}.thread_{min,max,started}
  95 </screen>
  96       </para>
  97       <para>
  98       This works in a similar fashion to
  99       binding of threads on MDS. MDS thread tuning is covered in
 100       <xref linkend="dbdoclet.mdsbinding" />.</para>
 101       <itemizedlist>
 102         <listitem>
 103           <para>
 104           <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
 105           on CPTs defined by
 106           <literal>[EXPRESSION]</literal>.</para>
 107         </listitem>
 108         <listitem>
 109           <para>
 110           <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
 111           on CPTs defined by
 112           <literal>[EXPRESSION]</literal>.</para>
 113         </listitem>
 114       </itemizedlist>
 115       <para>For further details, see
 116       <xref linkend="dbdoclet.50438271_87260" />.</para>
 117     </section>
 118     <section xml:id="dbdoclet.mdstuning">
 119       <title>
 120       <indexterm>
 121         <primary>tuning</primary>
 122         <secondary>MDS threads</secondary>
 123       </indexterm>Specifying the MDS Service Thread Count</title>
 124       <para>The
 125       <literal>mds_num_threads</literal> parameter enables the number of MDS
 126       service threads to be specified at module load time on the MDS
 127       node:</para>
 128       <screen>options mds mds_num_threads={N}</screen>
 129       <para>After startup, the minimum and maximum number of MDS thread counts
 130       can be set via the
 131       <literal>{service}.thread_{min,max,started}</literal> tunable. To change
 132       the tunable at runtime, run:</para>
 133       <para>
 134         <screen>
 135 lctl {get,set}_param {service}.thread_{min,max,started}
 136 </screen>
 137       </para>
 138       <para>For details, see
 139       <xref linkend="dbdoclet.50438271_87260" />.</para>
 140       <para>The number of MDS service threads started depends on system size
 141       and the load on the server, and has a default maximum of 64. The
 142       maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
 143       is 1024.</para>
 144       <note>
 145         <para>The OSS and MDS start two threads per service per CPT at mount
 146         time, and dynamically increase the number of running service threads in
 147         response to server load. Setting the <literal>*_num_threads</literal>
 148         module parameter starts the specified number of threads for that
 149         service immediately and disables automatic thread creation behavior.
 150         </para>
 151       </note>
 152       <para condition='l23'>Lustre software release 2.3 introduced new
 153       parameters to provide more control to administrators.</para>
 154       <itemizedlist>
 155         <listitem>
 156           <para>
 157           <literal>mds_rdpg_num_threads</literal> controls the number of threads
 158           in providing the read page service. The read page service handles
 159           file close and readdir operations.</para>
 160         </listitem>
 161         <listitem>
 162           <para>
 163           <literal>mds_attr_num_threads</literal> controls the number of threads
 164           in providing the setattr service to clients running Lustre software
 165           release 1.8.</para>
 166         </listitem>
 167       </itemizedlist>
 168     </section>
 169   </section>
 170   <section xml:id="dbdoclet.mdsbinding" condition='l23'>
 171     <title>
 172     <indexterm>
 173       <primary>tuning</primary>
 174       <secondary>MDS binding</secondary>
 175     </indexterm>Binding MDS Service Thread to CPU Partitions</title>
 176     <para>With the introduction of Node Affinity (
 177     <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
 178     can be bound to particular CPU partitions (CPTs) to improve CPU cache
 179     usage and memory locality.  Default values for CPT counts and CPU core
 180     bindings are selected automatically to provide good overall performance for
 181     a given CPU count. However, an administrator can deviate from these setting
 182     if they choose.  For details on specifying the mapping of CPU cores to
 183     CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
 184     </para>
 185     <itemizedlist>
 186       <listitem>
 187         <para>
 188         <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
 189         service threads to CPTs defined by
 190         <literal>EXPRESSION</literal>. For example
 191         <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
 192         to
 193         <literal>CPT[0,1,2,3]</literal>.</para>
 194       </listitem>
 195       <listitem>
 196         <para>
 197         <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
 198         service threads to CPTs defined by
 199         <literal>EXPRESSION</literal>. The read page service handles file close
 200         and readdir requests. For example
 201         <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
 202         to
 203         <literal>CPT4</literal>.</para>
 204       </listitem>
 205       <listitem>
 206         <para>
 207         <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
 208         service threads to CPTs defined by
 209         <literal>EXPRESSION</literal>.</para>
 210       </listitem>
 211     </itemizedlist>
 212         <para>Parameters must be set before module load in the file
 213     <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
 214     <example><title>lustre.conf</title>
 215     <screen>options lnet networks=tcp0(eth0)
 216 options mdt mds_num_cpts=[0]</screen>
 217     </example>
 218     </para>
 219   </section>
 220   <section xml:id="dbdoclet.50438272_73839">
 221     <title>
 222     <indexterm>
 223       <primary>LNet</primary>
 224       <secondary>tuning</secondary>
 225     </indexterm>
 226     <indexterm>
 227       <primary>tuning</primary>
 228       <secondary>LNet</secondary>
 229     </indexterm>Tuning LNet Parameters</title>
 230     <para>This section describes LNet tunables, the use of which may be
 231     necessary on some systems to improve performance. To test the performance
 232     of your Lustre network, see
 233     <xref linkend='lnetselftest' />.</para>
 234     <section remap="h3">
 235       <title>Transmit and Receive Buffer Size</title>
 236       <para>The kernel allocates buffers for sending and receiving messages on
 237       a network.</para>
 238       <para>
 239       <literal>ksocklnd</literal> has separate parameters for the transmit and
 240       receive buffers.</para>
 241       <screen>
 242 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
 243 </screen>
 244       <para>If these parameters are left at the default value (0), the system
 245       automatically tunes the transmit and receive buffer size. In almost every
 246       case, this default produces the best performance. Do not attempt to tune
 247       these parameters unless you are a network expert.</para>
 248     </section>
 249     <section remap="h3">
 250       <title>Hardware Interrupts (
 251       <literal>enable_irq_affinity</literal>)</title>
 252       <para>The hardware interrupts that are generated by network adapters may
 253       be handled by any CPU in the system. In some cases, we would like network
 254       traffic to remain local to a single CPU to help keep the processor cache
 255       warm and minimize the impact of context switches. This is helpful when an
 256       SMP system has more than one network interface and ideal when the number
 257       of interfaces equals the number of CPUs. To enable the
 258       <literal>enable_irq_affinity</literal> parameter, enter:</para>
 259       <screen>
 260 options ksocklnd enable_irq_affinity=1
 261 </screen>
 262       <para>In other cases, if you have an SMP platform with a single fast
 263       interface such as 10 Gb Ethernet and more than two CPUs, you may see
 264       performance improve by turning this parameter off.</para>
 265       <screen>
 266 options ksocklnd enable_irq_affinity=0
 267 </screen>
 268       <para>By default, this parameter is off. As always, you should test the
 269       performance to compare the impact of changing this parameter.</para>
 270     </section>
 271     <section condition='l23'>
 272       <title>
 273       <indexterm>
 274         <primary>tuning</primary>
 275         <secondary>Network interface binding</secondary>
 276       </indexterm>Binding Network Interface Against CPU Partitions</title>
 277       <para>Lustre software release 2.3 and beyond provide enhanced network
 278       interface control. The enhancement means that an administrator can bind
 279       an interface to one or more CPU partitions. Bindings are specified as
 280       options to the LNet modules. For more information on specifying module
 281       options, see
 282       <xref linkend="dbdoclet.50438293_15350" /></para>
 283       <para>For example,
 284       <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
 285       <literal>o2ib0</literal> will be handled by LND threads executing on
 286       <literal>CPT0</literal> and
 287       <literal>CPT1</literal>. An additional example might be:
 288       <literal>tcp1(eth0)[0]</literal>. Messages for
 289       <literal>tcp1</literal> are handled by threads on
 290       <literal>CPT0</literal>.</para>
 291     </section>
 292     <section>
 293       <title>
 294       <indexterm>
 295         <primary>tuning</primary>
 296         <secondary>Network interface credits</secondary>
 297       </indexterm>Network Interface Credits</title>
 298       <para>Network interface (NI) credits are shared across all CPU partitions
 299       (CPT). For example, if a machine has four CPTs and the number of NI
 300       credits is 512, then each partition has 128 credits. If a large number of
 301       CPTs exist on the system, LNet checks and validates the NI credits for
 302       each CPT to ensure each CPT has a workable number of credits. For
 303       example, if a machine has 16 CPTs and the number of NI credits is 256,
 304       then each partition only has 16 credits. 16 NI credits is low and could
 305       negatively impact performance. As a result, LNet automatically adjusts
 306       the credits to 8*
 307       <literal>peer_credits</literal>(
 308       <literal>peer_credits</literal> is 8 by default), so each partition has 64
 309       credits.</para>
 310       <para>Increasing the number of
 311       <literal>credits</literal>/
 312       <literal>peer_credits</literal> can improve the performance of high
 313       latency networks (at the cost of consuming more memory) by enabling LNet
 314       to send more inflight messages to a specific network/peer and keep the
 315       pipeline saturated.</para>
 316       <para>An administrator can modify the NI credit count using
 317       <literal>ksoclnd</literal> or
 318       <literal>ko2iblnd</literal>. In the example below, 256 credits are
 319       applied to TCP connections.</para>
 320       <screen>
 321 ksocklnd credits=256
 322 </screen>
 323       <para>Applying 256 credits to IB connections can be achieved with:</para>
 324       <screen>
 325 ko2iblnd credits=256
 326 </screen>
 327       <note condition="l23">
 328         <para>In Lustre software release 2.3 and beyond, LNet may revalidate
 329         the NI credits, so the administrator's request may not persist.</para>
 330       </note>
 331     </section>
 332     <section>
 333       <title>
 334       <indexterm>
 335         <primary>tuning</primary>
 336         <secondary>router buffers</secondary>
 337       </indexterm>Router Buffers</title>
 338       <para>When a node is set up as an LNet router, three pools of buffers are
 339       allocated: tiny, small and large. These pools are allocated per CPU
 340       partition and are used to buffer messages that arrive at the router to be
 341       forwarded to the next hop. The three different buffer sizes accommodate
 342       different size messages.</para>
 343       <para>If a message arrives that can fit in a tiny buffer then a tiny
 344       buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
 345       small buffer, then a small buffer is used. Finally if a message does not
 346       fit in either a tiny buffer or a small buffer, a large buffer is
 347       used.</para>
 348       <para>Router buffers are shared by all CPU partitions. For a machine with
 349       a large number of CPTs, the router buffer number may need to be specified
 350       manually for best performance. A low number of router buffers risks
 351       starving the CPU partitions of resources.</para>
 352       <itemizedlist>
 353         <listitem>
 354           <para>
 355           <literal>tiny_router_buffers</literal>: Zero payload buffers used for
 356           signals and acknowledgements.</para>
 357         </listitem>
 358         <listitem>
 359           <para>
 360           <literal>small_router_buffers</literal>: 4 KB payload buffers for
 361           small messages</para>
 362         </listitem>
 363         <listitem>
 364           <para>
 365           <literal>large_router_buffers</literal>: 1 MB maximum payload
 366           buffers, corresponding to the recommended RPC size of 1 MB.</para>
 367         </listitem>
 368       </itemizedlist>
 369       <para>The default setting for router buffers typically results in
 370       acceptable performance. LNet automatically sets a default value to reduce
 371       the likelihood of resource starvation. The size of a router buffer can be
 372       modified as shown in the example below. In this example, the size of the
 373       large buffer is modified using the
 374       <literal>large_router_buffers</literal> parameter.</para>
 375       <screen>
 376 lnet large_router_buffers=8192
 377 </screen>
 378       <note condition="l23">
 379         <para>In Lustre software release 2.3 and beyond, LNet may revalidate
 380         the router buffer setting, so the administrator's request may not
 381         persist.</para>
 382       </note>
 383     </section>
 384     <section>
 385       <title>
 386       <indexterm>
 387         <primary>tuning</primary>
 388         <secondary>portal round-robin</secondary>
 389       </indexterm>Portal Round-Robin</title>
 390       <para>Portal round-robin defines the policy LNet applies to deliver
 391       events and messages to the upper layers. The upper layers are PLRPC
 392       service or LNet selftest.</para>
 393       <para>If portal round-robin is disabled, LNet will deliver messages to
 394       CPTs based on a hash of the source NID. Hence, all messages from a
 395       specific peer will be handled by the same CPT. This can reduce data
 396       traffic between CPUs. However, for some workloads, this behavior may
 397       result in poorly balancing loads across the CPU.</para>
 398       <para>If portal round-robin is enabled, LNet will round-robin incoming
 399       events across all CPTs. This may balance load better across the CPU but
 400       can incur a cross CPU overhead.</para>
 401       <para>The current policy can be changed by an administrator with
 402       <literal>echo
 403       <replaceable>value</replaceable>&gt;
 404       /proc/sys/lnet/portal_rotor</literal>. There are four options for
 405       <literal>
 406         <replaceable>value</replaceable>
 407       </literal>:</para>
 408       <itemizedlist>
 409         <listitem>
 410           <para>
 411             <literal>OFF</literal>
 412           </para>
 413           <para>Disable portal round-robin on all incoming requests.</para>
 414         </listitem>
 415         <listitem>
 416           <para>
 417             <literal>ON</literal>
 418           </para>
 419           <para>Enable portal round-robin on all incoming requests.</para>
 420         </listitem>
 421         <listitem>
 422           <para>
 423             <literal>RR_RT</literal>
 424           </para>
 425           <para>Enable portal round-robin only for routed messages.</para>
 426         </listitem>
 427         <listitem>
 428           <para>
 429             <literal>HASH_RT</literal>
 430           </para>
 431           <para>Routed messages will be delivered to the upper layer by hash of
 432           source NID (instead of NID of router.) This is the default
 433           value.</para>
 434         </listitem>
 435       </itemizedlist>
 436     </section>
 437     <section>
 438       <title>LNet Peer Health</title>
 439       <para>Two options are available to help determine peer health:
 440       <itemizedlist>
 441         <listitem>
 442           <para>
 443           <literal>peer_timeout</literal>- The timeout (in seconds) before an
 444           aliveness query is sent to a peer. For example, if
 445           <literal>peer_timeout</literal> is set to
 446           <literal>180sec</literal>, an aliveness query is sent to the peer
 447           every 180 seconds. This feature only takes effect if the node is
 448           configured as an LNet router.</para>
 449           <para>In a routed environment, the
 450           <literal>peer_timeout</literal> feature should always be on (set to a
 451           value in seconds) on routers. If the router checker has been enabled,
 452           the feature should be turned off by setting it to 0 on clients and
 453           servers.</para>
 454           <para>For a non-routed scenario, enabling the
 455           <literal>peer_timeout</literal> option provides health information
 456           such as whether a peer is alive or not. For example, a client is able
 457           to determine if an MGS or OST is up when it sends it a message. If a
 458           response is received, the peer is alive; otherwise a timeout occurs
 459           when the request is made.</para>
 460           <para>In general,
 461           <literal>peer_timeout</literal> should be set to no less than the LND
 462           timeout setting. For more information about LND timeouts, see
 463           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 464           linkend="section_c24_nt5_dl" />.</para>
 465           <para>When the
 466           <literal>o2iblnd</literal>(IB) driver is used,
 467           <literal>peer_timeout</literal> should be at least twice the value of
 468           the
 469           <literal>ko2iblnd</literal> keepalive option. for more information
 470           about keepalive options, see
 471           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 472           linkend="section_ngq_qhy_zl" />.</para>
 473         </listitem>
 474         <listitem>
 475           <para>
 476           <literal>avoid_asym_router_failure</literal>– When set to 1, the
 477           router checker running on the client or a server periodically pings
 478           all the routers corresponding to the NIDs identified in the routes
 479           parameter setting on the node to determine the status of each router
 480           interface. The default setting is 1. (For more information about the
 481           LNet routes parameter, see
 482           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 483           linkend="dbdoclet.50438216_71227" /></para>
 484           <para>A router is considered down if any of its NIDs are down. For
 485           example, router X has three NIDs:
 486           <literal>Xnid1</literal>,
 487           <literal>Xnid2</literal>, and
 488           <literal>Xnid3</literal>. A client is connected to the router via
 489           <literal>Xnid1</literal>. The client has router checker enabled. The
 490           router checker periodically sends a ping to the router via
 491           <literal>Xnid1</literal>. The router responds to the ping with the
 492           status of each of its NIDs. In this case, it responds with
 493           <literal>Xnid1=up</literal>,
 494           <literal>Xnid2=up</literal>,
 495           <literal>Xnid3=down</literal>. If
 496           <literal>avoid_asym_router_failure==1</literal>, the router is
 497           considered down if any of its NIDs are down, so router X is
 498           considered down and will not be used for routing messages. If
 499           <literal>avoid_asym_router_failure==0</literal>, router X will
 500           continue to be used for routing messages.</para>
 501         </listitem>
 502       </itemizedlist></para>
 503       <para>The following router checker parameters must be set to the maximum
 504       value of the corresponding setting for this option on any client or
 505       server:
 506       <itemizedlist>
 507         <listitem>
 508           <para>
 509             <literal>dead_router_check_interval</literal>
 510           </para>
 511         </listitem>
 512         <listitem>
 513           <para>
 514             <literal>live_router_check_interval</literal>
 515           </para>
 516         </listitem>
 517         <listitem>
 518           <para>
 519             <literal>router_ping_timeout</literal>
 520           </para>
 521         </listitem>
 522       </itemizedlist></para>
 523       <para>For example, the
 524       <literal>dead_router_check_interval</literal> parameter on any router must
 525       be MAX.</para>
 526     </section>
 527   </section>
 528   <section xml:id="dbdoclet.libcfstuning" condition='l23'>
 529     <title>
 530     <indexterm>
 531       <primary>tuning</primary>
 532       <secondary>libcfs</secondary>
 533     </indexterm>libcfs Tuning</title>
 534     <para>Lustre software release 2.3 introduced binding service threads via
 535     CPU Partition Tables (CPTs). This allows the system administrator to
 536     fine-tune on which CPU cores the Lustre service threads are run, for both
 537     OSS and MDS services, as well as on the client.
 538     </para>
 539     <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
 540     system functions such as system monitoring, HA heartbeat, or similar
 541     tasks.  On the client it may be useful to restrict Lustre RPC service
 542     threads to a small subset of cores so that they do not interfere with
 543     computation, or because these cores are directly attached to the network
 544     interfaces.
 545     </para>
 546     <para>By default, the Lustre software will automatically generate CPU
 547     partitions (CPT) based on the number of CPUs in the system.
 548     The CPT count can be explicitly set on the libcfs module using
 549     <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
 550     The value of <literal>cpu_npartitions</literal> must be an integer between
 551     1 and the number of online CPUs.
 552     </para>
 553     <para condition='l29'>In Lustre 2.9 and later the default is to use
 554     one CPT per NUMA node.  In earlier versions of Lustre, by default there
 555     was a single CPT if the online CPU core count was four or fewer, and
 556     additional CPTs would be created depending on the number of CPU cores,
 557     typically with 4-8 cores per CPT.
 558     </para>
 559     <tip>
 560       <para>Setting <literal>cpu_npartitions=1</literal> will disable most
 561       of the SMP Node Affinity functionality.</para>
 562     </tip>
 563     <section>
 564       <title>CPU Partition String Patterns</title>
 565       <para>CPU partitions can be described using string pattern notation.
 566       If <literal>cpu_pattern=N</literal> is used, then there will be one
 567       CPT for each NUMA node in the system, with each CPT mapping all of
 568       the CPU cores for that NUMA node.
 569       </para>
 570       <para>It is also possible to explicitly specify the mapping between
 571       CPU cores and CPTs, for example:</para>
 572       <itemizedlist>
 573         <listitem>
 574           <para>
 575             <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
 576           </para>
 577           <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
 578           contains cores 3, 5, 7.  CPU cores 0 and 1 will not be used by Lustre
 579           service threads, and could be used for node services such as
 580           system monitoring, HA heartbeat threads, etc.  The binding of
 581           non-Lustre services to those CPU cores may be done in userspace
 582           using <literal>numactl(8)</literal> or other application-specific
 583           methods, but is beyond the scope of this document.</para>
 584         </listitem>
 585         <listitem>
 586           <para>
 587             <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
 588           </para>
 589           <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
 590           node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
 591         </listitem>
 592       </itemizedlist>
 593       <para>The current configuration of the CPU partition can be read via
 594       <literal>lctl get_parm cpu_partition_table</literal>.  For example,
 595       a simple 4-core system has a single CPT with all four CPU cores:
 596       <screen>$ lctl get_param cpu_partition_table
 597 cpu_partition_table=0   : 0 1 2 3</screen>
 598       while a larger NUMA system with four 12-core CPUs may have four CPTs:
 599       <screen>$ lctl get_param cpu_partition_table
 600 cpu_partition_table=
 601 0       : 0 1 2 3 4 5 6 7 8 9 10 11
 602 1       : 12 13 14 15 16 17 18 19 20 21 22 23
 603 2       : 24 25 26 27 28 29 30 31 32 33 34 35
 604 3       : 36 37 38 39 40 41 42 43 44 45 46 47
 605 </screen>
 606       </para>
 607     </section>
 608   </section>
 609   <section xml:id="dbdoclet.lndtuning">
 610     <title>
 611     <indexterm>
 612       <primary>tuning</primary>
 613       <secondary>LND tuning</secondary>
 614     </indexterm>LND Tuning</title>
 615     <para>LND tuning allows the number of threads per CPU partition to be
 616     specified. An administrator can set the threads for both
 617     <literal>ko2iblnd</literal> and
 618     <literal>ksocklnd</literal> using the
 619     <literal>nscheds</literal> parameter. This adjusts the number of threads for
 620     each partition, not the overall number of threads on the LND.</para>
 621     <note>
 622       <para>Lustre software release 2.3 has greatly decreased the default
 623       number of threads for
 624       <literal>ko2iblnd</literal> and
 625       <literal>ksocklnd</literal> on high-core count machines. The current
 626       default values are automatically set and are chosen to work well across a
 627       number of typical scenarios.</para>
 628     </note>
 629   </section>
 630   <section xml:id="dbdoclet.nrstuning" condition='l24'>
 631     <title>
 632     <indexterm>
 633       <primary>tuning</primary>
 634       <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 635     </indexterm>Network Request Scheduler (NRS) Tuning</title>
 636     <para>The Network Request Scheduler (NRS) allows the administrator to
 637     influence the order in which RPCs are handled at servers, on a per-PTLRPC
 638     service basis, by providing different policies that can be activated and
 639     tuned in order to influence the RPC ordering. The aim of this is to provide
 640     for better performance, and possibly discrete performance characteristics
 641     using future policies.</para>
 642     <para>The NRS policy state of a PTLRPC service can be read and set via the
 643     <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
 644     service's NRS policy state, run:</para>
 645     <screen>
 646 lctl get_param {service}.nrs_policies
 647 </screen>
 648     <para>For example, to read the NRS policy state of the
 649     <literal>ost_io</literal> service, run:</para>
 650     <screen>
 651 $ lctl get_param ost.OSS.ost_io.nrs_policies
 652 ost.OSS.ost_io.nrs_policies=
 653
 654 regular_requests:
 655   - name: fifo
 656     state: started
 657     fallback: yes
 658     queued: 0
 659     active: 0
 660
 661   - name: crrn
 662     state: stopped
 663     fallback: no
 664     queued: 0
 665     active: 0
 666
 667   - name: orr
 668     state: stopped
 669     fallback: no
 670     queued: 0
 671     active: 0
 672
 673   - name: trr
 674     state: started
 675     fallback: no
 676     queued: 2420
 677     active: 268
 678
 679 high_priority_requests:
 680   - name: fifo
 681     state: started
 682     fallback: yes
 683     queued: 0
 684     active: 0
 685
 686   - name: crrn
 687     state: stopped
 688     fallback: no
 689     queued: 0
 690     active: 0
 691
 692   - name: orr
 693     state: stopped
 694     fallback: no
 695     queued: 0
 696     active: 0
 697
 698   - name: trr
 699     state: stopped
 700     fallback: no
 701     queued: 0
 702     active: 0
 703
 704 </screen>
 705     <para>NRS policy state is shown in either one or two sections, depending on
 706     the PTLRPC service being queried. The first section is named
 707     <literal>regular_requests</literal> and is available for all PTLRPC
 708     services, optionally followed by a second section which is named
 709     <literal>high_priority_requests</literal>. This is because some PTLRPC
 710     services are able to treat some types of RPCs as higher priority ones, such
 711     that they are handled by the server with higher priority compared to other,
 712     regular RPC traffic. For PTLRPC services that do not support high-priority
 713     RPCs, you will only see the
 714     <literal>regular_requests</literal> section.</para>
 715     <para>There is a separate instance of each NRS policy on each PTLRPC
 716     service for handling regular and high-priority RPCs (if the service
 717     supports high-priority RPCs). For each policy instance, the following
 718     fields are shown:</para>
 719     <informaltable frame="all">
 720       <tgroup cols="2">
 721         <colspec colname="c1" colwidth="50*" />
 722         <colspec colname="c2" colwidth="50*" />
 723         <thead>
 724           <row>
 725             <entry>
 726               <para>
 727                 <emphasis role="bold">Field</emphasis>
 728               </para>
 729             </entry>
 730             <entry>
 731               <para>
 732                 <emphasis role="bold">Description</emphasis>
 733               </para>
 734             </entry>
 735           </row>
 736         </thead>
 737         <tbody>
 738           <row>
 739             <entry>
 740               <para>
 741                 <literal>name</literal>
 742               </para>
 743             </entry>
 744             <entry>
 745               <para>The name of the policy.</para>
 746             </entry>
 747           </row>
 748           <row>
 749             <entry>
 750               <para>
 751                 <literal>state</literal>
 752               </para>
 753             </entry>
 754             <entry>
 755               <para>The state of the policy; this can be any of
 756               <literal>invalid, stopping, stopped, starting, started</literal>.
 757               A fully enabled policy is in the
 758               <literal>started</literal> state.</para>
 759             </entry>
 760           </row>
 761           <row>
 762             <entry>
 763               <para>
 764                 <literal>fallback</literal>
 765               </para>
 766             </entry>
 767             <entry>
 768               <para>Whether the policy is acting as a fallback policy or not. A
 769               fallback policy is used to handle RPCs that other enabled
 770               policies fail to handle, or do not support the handling of. The
 771               possible values are
 772               <literal>no, yes</literal>. Currently, only the FIFO policy can
 773               act as a fallback policy.</para>
 774             </entry>
 775           </row>
 776           <row>
 777             <entry>
 778               <para>
 779                 <literal>queued</literal>
 780               </para>
 781             </entry>
 782             <entry>
 783               <para>The number of RPCs that the policy has waiting to be
 784               serviced.</para>
 785             </entry>
 786           </row>
 787           <row>
 788             <entry>
 789               <para>
 790                 <literal>active</literal>
 791               </para>
 792             </entry>
 793             <entry>
 794               <para>The number of RPCs that the policy is currently
 795               handling.</para>
 796             </entry>
 797           </row>
 798         </tbody>
 799       </tgroup>
 800     </informaltable>
 801     <para>To enable an NRS policy on a PTLRPC service run:</para>
 802     <screen>
 803 lctl set_param {service}.nrs_policies=
 804 <replaceable>policy_name</replaceable>
 805 </screen>
 806     <para>This will enable the policy
 807     <replaceable>policy_name</replaceable>for both regular and high-priority
 808     RPCs (if the PLRPC service supports high-priority RPCs) on the given
 809     service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
 810     service, run:</para>
 811     <screen>
 812 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
 813 ldlm.services.ldlm_cbd.nrs_policies=crrn
 814
 815 </screen>
 816     <para>For PTLRPC services that support high-priority RPCs, you can also
 817     supply an optional
 818     <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
 819     for handling only regular or high-priority RPCs on a given PTLRPC service,
 820     by running:</para>
 821     <screen>
 822 lctl set_param {service}.nrs_policies="
 823 <replaceable>policy_name</replaceable>
 824 <replaceable>reg|hp</replaceable>"
 825 </screen>
 826     <para>For example, to enable the TRR policy for handling only regular, but
 827     not high-priority RPCs on the
 828     <literal>ost_io</literal> service, run:</para>
 829     <screen>
 830 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
 831 ost.OSS.ost_io.nrs_policies="trr reg"
 832
 833 </screen>
 834     <note>
 835       <para>When enabling an NRS policy, the policy name must be given in
 836       lower-case characters, otherwise the operation will fail with an error
 837       message.</para>
 838     </note>
 839     <section>
 840       <title>
 841       <indexterm>
 842         <primary>tuning</primary>
 843         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 844         <tertiary>first in, first out (FIFO) policy</tertiary>
 845       </indexterm>First In, First Out (FIFO) policy</title>
 846       <para>The first in, first out (FIFO) policy handles RPCs in a service in
 847       the same order as they arrive from the LNet layer, so no special
 848       processing takes place to modify the RPC handling stream. FIFO is the
 849       default policy for all types of RPCs on all PTLRPC services, and is
 850       always enabled irrespective of the state of other policies, so that it
 851       can be used as a backup policy, in case a more elaborate policy that has
 852       been enabled fails to handle an RPC, or does not support handling a given
 853       type of RPC.</para>
 854       <para>The FIFO policy has no tunables that adjust its behaviour.</para>
 855     </section>
 856     <section>
 857       <title>
 858       <indexterm>
 859         <primary>tuning</primary>
 860         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 861         <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
 862       </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
 863       <para>The client round-robin over NIDs (CRR-N) policy performs batched
 864       round-robin scheduling of all types of RPCs, with each batch consisting
 865       of RPCs originating from the same client node, as identified by its NID.
 866       CRR-N aims to provide for better resource utilization across the cluster,
 867       and to help shorten completion times of jobs in some cases, by
 868       distributing available bandwidth more evenly across all clients.</para>
 869       <para>The CRR-N policy can be enabled on all types of PTLRPC services,
 870       and has the following tunable that can be used to adjust its
 871       behavior:</para>
 872       <itemizedlist>
 873         <listitem>
 874           <para>
 875             <literal>{service}.nrs_crrn_quantum</literal>
 876           </para>
 877           <para>The
 878           <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
 879           maximum allowed size of each batch of RPCs; the unit of measure is in
 880           number of RPCs. To read the maximum allowed batch size of a CRR-N
 881           policy, run:</para>
 882           <screen>
 883 lctl get_param {service}.nrs_crrn_quantum
 884 </screen>
 885           <para>For example, to read the maximum allowed batch size of a CRR-N
 886           policy on the ost_io service, run:</para>
 887           <screen>
 888 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
 889 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
 890 hp_quantum:8
 891
 892 </screen>
 893           <para>You can see that there is a separate maximum allowed batch size
 894           value for regular (
 895           <literal>reg_quantum</literal>) and high-priority (
 896           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
 897           high-priority RPCs).</para>
 898           <para>To set the maximum allowed batch size of a CRR-N policy on a
 899           given service, run:</para>
 900           <screen>
 901 lctl set_param {service}.nrs_crrn_quantum=
 902 <replaceable>1-65535</replaceable>
 903 </screen>
 904           <para>This will set the maximum allowed batch size on a given
 905           service, for both regular and high-priority RPCs (if the PLRPC
 906           service supports high-priority RPCs), to the indicated value.</para>
 907           <para>For example, to set the maximum allowed batch size on the
 908           ldlm_canceld service to 16 RPCs, run:</para>
 909           <screen>
 910 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 911 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
 912
 913 </screen>
 914           <para>For PTLRPC services that support high-priority RPCs, you can
 915           also specify a different maximum allowed batch size for regular and
 916           high-priority RPCs, by running:</para>
 917           <screen>
 918 $ lctl set_param {service}.nrs_crrn_quantum=
 919 <replaceable>reg_quantum|hp_quantum</replaceable>:
 920 <replaceable>1-65535</replaceable>"
 921 </screen>
 922           <para>For example, to set the maximum allowed batch size on the
 923           ldlm_canceld service, for high-priority RPCs to 32, run:</para>
 924           <screen>
 925 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
 926 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
 927
 928 </screen>
 929           <para>By using the last method, you can also set the maximum regular
 930           and high-priority RPC batch sizes to different values, in a single
 931           command invocation.</para>
 932         </listitem>
 933       </itemizedlist>
 934     </section>
 935     <section>
 936       <title>
 937       <indexterm>
 938         <primary>tuning</primary>
 939         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
 940         <tertiary>object-based round-robin (ORR) policy</tertiary>
 941       </indexterm>Object-based Round-Robin (ORR) policy</title>
 942       <para>The object-based round-robin (ORR) policy performs batched
 943       round-robin scheduling of bulk read write (brw) RPCs, with each batch
 944       consisting of RPCs that pertain to the same backend-file system object,
 945       as identified by its OST FID.</para>
 946       <para>The ORR policy is only available for use on the ost_io service. The
 947       RPC batches it forms can potentially consist of mixed bulk read and bulk
 948       write RPCs. The RPCs in each batch are ordered in an ascending manner,
 949       based on either the file offsets, or the physical disk offsets of each
 950       RPC (only applicable to bulk read RPCs).</para>
 951       <para>The aim of the ORR policy is to provide for increased bulk read
 952       throughput in some cases, by ordering bulk read RPCs (and potentially
 953       bulk write RPCs), and thus minimizing costly disk seek operations.
 954       Performance may also benefit from any resulting improvement in resource
 955       utilization, or by taking advantage of better locality of reference
 956       between RPCs.</para>
 957       <para>The ORR policy has the following tunables that can be used to
 958       adjust its behaviour:</para>
 959       <itemizedlist>
 960         <listitem>
 961           <para>
 962             <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
 963           </para>
 964           <para>The
 965           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
 966           the maximum allowed size of each batch of RPCs; the unit of measure
 967           is in number of RPCs. To read the maximum allowed batch size of the
 968           ORR policy, run:</para>
 969           <screen>
 970 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
 971 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
 972 hp_quantum:16
 973
 974 </screen>
 975           <para>You can see that there is a separate maximum allowed batch size
 976           value for regular (
 977           <literal>reg_quantum</literal>) and high-priority (
 978           <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
 979           high-priority RPCs).</para>
 980           <para>To set the maximum allowed batch size for the ORR policy,
 981           run:</para>
 982           <screen>
 983 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
 984 <replaceable>1-65535</replaceable>
 985 </screen>
 986           <para>This will set the maximum allowed batch size for both regular
 987           and high-priority RPCs, to the indicated value.</para>
 988           <para>You can also specify a different maximum allowed batch size for
 989           regular and high-priority RPCs, by running:</para>
 990           <screen>
 991 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
 992 <replaceable>reg_quantum|hp_quantum</replaceable>:
 993 <replaceable>1-65535</replaceable>
 994 </screen>
 995           <para>For example, to set the maximum allowed batch size for regular
 996           RPCs to 128, run:</para>
 997           <screen>
 998 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
 999 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1000
1001 </screen>
1002           <para>By using the last method, you can also set the maximum regular
1003           and high-priority RPC batch sizes to different values, in a single
1004           command invocation.</para>
1005         </listitem>
1006         <listitem>
1007           <para>
1008             <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
1009           </para>
1010           <para>The
1011           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
1012           determines whether the ORR policy orders RPCs within each batch based
1013           on logical file offsets or physical disk offsets. To read the offset
1014           type value for the ORR policy, run:</para>
1015           <screen>
1016 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
1017 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
1018 hp_offset_type:logical
1019
1020 </screen>
1021           <para>You can see that there is a separate offset type value for
1022           regular (
1023           <literal>reg_offset_type</literal>) and high-priority (
1024           <literal>hp_offset_type</literal>) RPCs.</para>
1025           <para>To set the ordering type for the ORR policy, run:</para>
1026           <screen>
1027 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1028 <replaceable>physical|logical</replaceable>
1029 </screen>
1030           <para>This will set the offset type for both regular and
1031           high-priority RPCs, to the indicated value.</para>
1032           <para>You can also specify a different offset type for regular and
1033           high-priority RPCs, by running:</para>
1034           <screen>
1035 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1036 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
1037 <replaceable>physical|logical</replaceable>
1038 </screen>
1039           <para>For example, to set the offset type for high-priority RPCs to
1040           physical disk offsets, run:</para>
1041           <screen>
1042 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1043 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1044 </screen>
1045           <para>By using the last method, you can also set offset type for
1046           regular and high-priority RPCs to different values, in a single
1047           command invocation.</para>
1048           <note>
1049             <para>Irrespective of the value of this tunable, only logical
1050             offsets can, and are used for ordering bulk write RPCs.</para>
1051           </note>
1052         </listitem>
1053         <listitem>
1054           <para>
1055             <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1056           </para>
1057           <para>The
1058           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1059           the type of RPCs that the ORR policy will handle. To read the types
1060           of supported RPCs by the ORR policy, run:</para>
1061           <screen>
1062 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1063 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1064 hp_supported=reads_and_writes
1065
1066 </screen>
1067           <para>You can see that there is a separate supported 'RPC types'
1068           value for regular (
1069           <literal>reg_supported</literal>) and high-priority (
1070           <literal>hp_supported</literal>) RPCs.</para>
1071           <para>To set the supported RPC types for the ORR policy, run:</para>
1072           <screen>
1073 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1074 <replaceable>reads|writes|reads_and_writes</replaceable>
1075 </screen>
1076           <para>This will set the supported RPC types for both regular and
1077           high-priority RPCs, to the indicated value.</para>
1078           <para>You can also specify a different supported 'RPC types' value
1079           for regular and high-priority RPCs, by running:</para>
1080           <screen>
1081 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1082 <replaceable>reg_supported|hp_supported</replaceable>:
1083 <replaceable>reads|writes|reads_and_writes</replaceable>
1084 </screen>
1085           <para>For example, to set the supported RPC types to bulk read and
1086           bulk write RPCs for regular requests, run:</para>
1087           <screen>
1088 $ lctl set_param
1089 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1090 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1091
1092 </screen>
1093           <para>By using the last method, you can also set the supported RPC
1094           types for regular and high-priority RPC to different values, in a
1095           single command invocation.</para>
1096         </listitem>
1097       </itemizedlist>
1098     </section>
1099     <section>
1100       <title>
1101       <indexterm>
1102         <primary>tuning</primary>
1103         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1104         <tertiary>Target-based round-robin (TRR) policy</tertiary>
1105       </indexterm>Target-based Round-Robin (TRR) policy</title>
1106       <para>The target-based round-robin (TRR) policy performs batched
1107       round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1108       that pertain to the same OST, as identified by its OST index.</para>
1109       <para>The TRR policy is identical to the object-based round-robin (ORR)
1110       policy, apart from using the brw RPC's target OST index instead of the
1111       backend-fs object's OST FID, for determining the RPC scheduling order.
1112       The goals of TRR are effectively the same as for ORR, and it uses the
1113       following tunables to adjust its behaviour:</para>
1114       <itemizedlist>
1115         <listitem>
1116           <para>
1117             <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1118           </para>
1119           <para>The purpose of this tunable is exactly the same as for the
1120           <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1121           policy, and you can use it in exactly the same way.</para>
1122         </listitem>
1123         <listitem>
1124           <para>
1125             <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1126           </para>
1127           <para>The purpose of this tunable is exactly the same as for the
1128           <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1129           ORR policy, and you can use it in exactly the same way.</para>
1130         </listitem>
1131         <listitem>
1132           <para>
1133             <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1134           </para>
1135           <para>The purpose of this tunable is exactly the same as for the
1136           <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1137           ORR policy, and you can use it in exactly the sme way.</para>
1138         </listitem>
1139       </itemizedlist>
1140     </section>
1141     <section xml:id="dbdoclet.tbftuning" condition='l26'>
1142       <title>
1143       <indexterm>
1144         <primary>tuning</primary>
1145         <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1146         <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1147       </indexterm>Token Bucket Filter (TBF) policy</title>
1148       <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1149       Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1150       (Quality of Service) purposes.</para>
1151       <figure>
1152         <title>The internal structure of TBF policy</title>
1153         <mediaobject>
1154           <imageobject>
1155             <imagedata scalefit="1" width="100%"
1156             fileref="figures/TBF_policy.svg" />
1157           </imageobject>
1158           <textobject>
1159             <phrase>The internal structure of TBF policy</phrase>
1160           </textobject>
1161         </mediaobject>
1162       </figure>
1163       <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1164       according to its classification. The classification of RPC requests is
1165       based on either NID or JobID of the RPC according to the configure of
1166       TBF. TBF policy maintains multiple queues in the system, one queue for
1167       each category in the classification of RPC requests. The requests waits
1168       for tokens in the FIFO queue before they have been handled so as to keep
1169       the RPC rates under the limits.</para>
1170       <para>When Lustre services are too busy to handle all of the requests in
1171       time, all of the specified rates of the queues will not be satisfied.
1172       Nothing bad will happen except some of the RPC rates are slower than
1173       configured. In this case, the queue with higher rate will have an
1174       advantage over the queues with lower rates, but none of them will be
1175       starved.</para>
1176       <para>To manage the RPC rate of queues, we don't need to set the rate of
1177       each queue manually. Instead, we define rules which TBF policy matches to
1178       determine RPC rate limits. All of the defined rules are organized as an
1179       ordered list. Whenever a queue is newly created, it goes though the rule
1180       list and takes the first matched rule as its rule, so that the queue
1181       knows its RPC token rate. A rule can be added to or removed from the list
1182       at run time. Whenever the list of rules is changed, the queues will
1183       update their matched rules.</para>
1184       <itemizedlist>
1185         <listitem>
1186           <para>
1187             <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
1188           </para>
1189           <para>The format of the rule start command of TBF policy is as
1190           follows:</para>
1191           <screen>
1192 $ lctl set_param x.x.x.nrs_tbf_rule=
1193           "[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
1194 </screen>
1195           <para>The '
1196           <replaceable>rule_name</replaceable>' argument is a string which
1197           identifies a rule. The format of the '
1198           <replaceable>arguments</replaceable>' is changing according to the
1199           type of the TBF policy. For the NID based TBF policy, its format is
1200           as follows:</para>
1201           <screen>
1202 $ lctl set_param x.x.x.nrs_tbf_rule=
1203           "[reg|hp] start <replaceable>rule_name</replaceable> {<replaceable>nidlist</replaceable>} <replaceable>rate</replaceable>"
1204 </screen>
1205           <para>The format of '
1206           <replaceable>nidlist</replaceable>' argument is the same as the
1207           format when configuring LNet route. The '
1208           <replaceable>rate</replaceable>' argument is the RPC rate of the
1209           rule, means the upper limit number of requests per second.</para>
1210           <para>Following commands are valid. Please note that a newly started
1211           rule is prior to old rules, so the order of starting rules is
1212           critical too.</para>
1213           <screen>
1214 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1215           "start other_clients {192.168.*.*@tcp} 50"
1216 </screen>
1217           <screen>
1218 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1219           "start loginnode {192.168.1.1@tcp} 100"
1220 </screen>
1221           <para>General rule can be replaced by two rules (reg and hp) as
1222           follows:</para>
1223           <screen>
1224 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1225           "reg start loginnode {192.168.1.1@tcp} 100"
1226 </screen>
1227           <screen>
1228 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1229           "hp start loginnode {192.168.1.1@tcp} 100"
1230 </screen>
1231           <screen>
1232 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1233           "start computes {192.168.1.[2-128]@tcp} 500"
1234 </screen>
1235           <para>The above rules will put an upper limit for servers to process
1236           at most 5x as many RPCs from compute nodes as login nodes.</para>
1237           <para>For the JobID (please see
1238           <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1239                 linkend="dbdoclet.jobstats" /> for more details) based TBF
1240           policy, its format is as follows:</para>
1241           <screen>
1242 $ lctl set_param x.x.x.nrs_tbf_rule=
1243           "[reg|hp] start <replaceable>name</replaceable> {<replaceable>jobid_list</replaceable>} <replaceable>rate</replaceable>"
1244 </screen>
1245           <para>Following commands are valid:</para>
1246           <screen>
1247 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1248           "start user1 {iozone.500 dd.500} 100"
1249 </screen>
1250           <screen>
1251 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1252           "start iozone_user1 {iozone.500} 100"
1253 </screen>
1254           <para>Same as nid, could use reg and hp rules separately:</para>
1255           <screen>
1256 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1257           "hp start iozone_user1 {iozone.500} 100"
1258 </screen>
1259           <screen>
1260 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1261           "reg start iozone_user1 {iozone.500} 100"
1262 </screen>
1263           <para>The format of the rule change command of TBF policy is as
1264           follows:</para>
1265           <screen>
1266 $ lctl set_param x.x.x.nrs_tbf_rule=
1267           "[reg|hp] change <replaceable>rule_name</replaceable> <replaceable>rate</replaceable>"
1268 </screen>
1269           <para>Following commands are valid:</para>
1270           <screen>
1271 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
1272 </screen>
1273           <screen>
1274 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
1275 </screen>
1276           <screen>
1277 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
1278 </screen>
1279           <para>The format of the rule stop command of TBF policy is as
1280           follows:</para>
1281           <screen>
1282 $ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1283 <replaceable>rule_name</replaceable>"
1284 </screen>
1285           <para>Following commands are valid:</para>
1286           <screen>
1287 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1288 </screen>
1289           <screen>
1290 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1291 </screen>
1292           <screen>
1293 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
1294 </screen>
1295         </listitem>
1296       </itemizedlist>
1297     </section>
1298   </section>
1299   <section xml:id="dbdoclet.50438272_25884">
1300     <title>
1301     <indexterm>
1302       <primary>tuning</primary>
1303       <secondary>lockless I/O</secondary>
1304     </indexterm>Lockless I/O Tunables</title>
1305     <para>The lockless I/O tunable feature allows servers to ask clients to do
1306     lockless I/O (the server does the locking on behalf of clients) for
1307     contended files to avoid lock ping-pong.</para>
1308     <para>The lockless I/O patch introduces these tunables:</para>
1309     <itemizedlist>
1310       <listitem>
1311         <para>
1312           <emphasis role="bold">OST-side:</emphasis>
1313         </para>
1314         <screen>
1315 ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
1316 </screen>
1317         <para>
1318         <literal>contended_locks</literal>- If the number of lock conflicts in
1319         the scan of granted and waiting queues at contended_locks is exceeded,
1320         the resource is considered to be contended.</para>
1321         <para>
1322         <literal>contention_seconds</literal>- The resource keeps itself in a
1323         contended state as set in the parameter.</para>
1324         <para>
1325         <literal>max_nolock_bytes</literal>- Server-side locking set only for
1326         requests less than the blocks set in the
1327         <literal>max_nolock_bytes</literal> parameter. If this tunable is
1328         set to zero (0), it disables server-side locking for read/write
1329         requests.</para>
1330       </listitem>
1331       <listitem>
1332         <para>
1333           <emphasis role="bold">Client-side:</emphasis>
1334         </para>
1335         <screen>
1336 /proc/fs/lustre/llite/lustre-*
1337 </screen>
1338         <para>
1339         <literal>contention_seconds</literal>-
1340         <literal>llite</literal> inode remembers its contended state for the
1341         time specified in this parameter.</para>
1342       </listitem>
1343       <listitem>
1344         <para>
1345           <emphasis role="bold">Client-side statistics:</emphasis>
1346         </para>
1347         <para>The
1348         <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
1349         rows for lockless I/O statistics.</para>
1350         <para>
1351         <literal>lockless_read_bytes</literal> and
1352         <literal>lockless_write_bytes</literal>- To count the total bytes read
1353         or written, the client makes its own decisions based on the request
1354         size. The client does not communicate with the server if the request
1355         size is smaller than the
1356         <literal>min_nolock_size</literal>, without acquiring locks by the
1357         client.</para>
1358       </listitem>
1359     </itemizedlist>
1360   </section>
1361   <section condition="l29">
1362       <title>
1363         <indexterm>
1364           <primary>tuning</primary>
1365           <secondary>with lfs ladvise</secondary>
1366         </indexterm>
1367         Server-Side Advice and Hinting
1368       </title>
1369       <section><title>Overview</title>
1370       <para>Use the <literal>lfs ladvise</literal> command give file access
1371       advices or hints to servers.</para>
1372       <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
1373 [--start|-s START[kMGT]]
1374 {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
1375 <emphasis>file</emphasis> ...
1376       </screen>
1377       <para>
1378         <informaltable frame="all">
1379           <tgroup cols="2">
1380           <colspec colname="c1" colwidth="50*"/>
1381           <colspec colname="c2" colwidth="50*"/>
1382           <thead>
1383             <row>
1384               <entry>
1385                 <para><emphasis role="bold">Option</emphasis></para>
1386               </entry>
1387               <entry>
1388                 <para><emphasis role="bold">Description</emphasis></para>
1389               </entry>
1390             </row>
1391           </thead>
1392           <tbody>
1393             <row>
1394               <entry>
1395                 <para><literal>-a</literal>, <literal>--advice=</literal>
1396                 <literal>ADVICE</literal></para>
1397               </entry>
1398               <entry>
1399                 <para>Give advice or hint of type <literal>ADVICE</literal>.
1400                 Advice types are:</para>
1401                 <para><literal>willread</literal> to prefetch data into server
1402                 cache</para>
1403                 <para><literal>dontneed</literal> to cleanup data cache on
1404                 server</para>
1405               </entry>
1406             </row>
1407             <row>
1408               <entry>
1409                 <para><literal>-b</literal>, <literal>--background</literal>
1410                 </para>
1411               </entry>
1412               <entry>
1413                 <para>Enable the advices to be sent and handled asynchronously.
1414                 </para>
1415               </entry>
1416             </row>
1417             <row>
1418               <entry>
1419                 <para><literal>-s</literal>, <literal>--start=</literal>
1420                         <literal>START_OFFSET</literal></para>
1421               </entry>
1422               <entry>
1423                 <para>File range starts from <literal>START_OFFSET</literal>
1424                 </para>
1425                 </entry>
1426             </row>
1427             <row>
1428                 <entry>
1429                     <para><literal>-e</literal>, <literal>--end=</literal>
1430                         <literal>END_OFFSET</literal></para>
1431                 </entry>
1432                 <entry>
1433                     <para>File range ends at (not including)
1434                     <literal>END_OFFSET</literal>.  This option may not be
1435                     specified at the same time as the <literal>-l</literal>
1436                     option.</para>
1437                 </entry>
1438             </row>
1439             <row>
1440                 <entry>
1441                     <para><literal>-l</literal>, <literal>--length=</literal>
1442                         <literal>LENGTH</literal></para>
1443                 </entry>
1444                 <entry>
1445                   <para>File range has length of <literal>LENGTH</literal>.
1446                   This option may not be specified at the same time as the
1447                   <literal>-e</literal> option.</para>
1448                 </entry>
1449             </row>
1450           </tbody>
1451           </tgroup>
1452         </informaltable>
1453       </para>
1454       <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
1455       Lustre servers without guaranteeing when and what servers will react to
1456       the advice. Actions may or may not triggered when the advices are
1457       recieved, depending on the type of the advice, as well as the real-time
1458       decision of the affected server-side components.</para>
1459       <para>A typical usage of ladvise is to enable applications and users with
1460       external knowledge to intervene in server-side cache management. For
1461       example, if a bunch of different clients are doing small random reads of a
1462       file, prefetching pages into OSS cache with big linear reads before the
1463       random IO is a net benefit. Fetching that data into each client cache with
1464       fadvise() may not be, due to much more data being sent to the client.
1465       </para>
1466       <para>The main difference between the Linux <literal>fadvise()</literal>
1467       system call and <literal>lfs ladvise</literal> is that
1468       <literal>fadvise()</literal> is only a client side mechanism that does
1469       not pass the advice to the filesystem, while <literal>ladvise</literal>
1470       can send advices or hints to the Lustre server side.</para>
1471       </section>
1472       <section><title>Examples</title>
1473         <para>The following example gives the OST(s) holding the first 1GB of
1474         <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
1475         file will be read soon.</para>
1476         <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
1477         </screen>
1478         <para>The following example gives the OST(s) holding the first 1GB of
1479         <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
1480         will not be read in the near future, thus the OST(s) could clear the
1481         cache of the file in the memory.</para>
1482         <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
1483         </screen>
1484       </section>
1485   </section>
1486   <section condition="l29">
1487       <title>
1488           <indexterm>
1489               <primary>tuning</primary>
1490               <secondary>Large Bulk IO</secondary>
1491           </indexterm>
1492           Large Bulk IO (16MB RPC)
1493       </title>
1494       <section><title>Overview</title>
1495           <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
1496           to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
1497           required to transfer the same amount of data between clients and
1498           servers.  With a larger RPC size, the OST can submit more data to the
1499           underlying disks at once, therefore it can produce larger disk I/Os
1500           to fully utilize the increasing bandwidth of disks.</para>
1501           <para>At client connecting time, clients will negotiate with
1502           servers for the RPC size it is going to use.</para>
1503           <para>A new parameter, <literal>brw_size</literal>, is introduced on
1504           the OST to tell the client the preferred IO size.  All clients that
1505           talk to this target should never send an RPC greater than this size.
1506           </para>
1507       </section>
1508       <section><title>Usage</title>
1509           <para>In order to enable a larger RPC size,
1510           <literal>brw_size</literal> must be changed to an IO size value up to
1511           16MB.  To temporarily change <literal>brw_size</literal>, the
1512           following command should be run on the OSS:</para>
1513           <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
1514           <para>To persistently change <literal>brw_size</literal>, one of the following
1515           commands should be run on the OSS:</para>
1516           <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
1517           <screen>oss# lctl conf_param <replaceable>fsname</replaceable>-OST*.obdfilter.brw_size=16</screen>
1518           <para>When a client connects to an OST target, it will fetch
1519           <literal>brw_size</literal> from the target and pick the maximum value
1520           of <literal>brw_size</literal> and its local setting for
1521           <literal>max_pages_per_rpc</literal> as the actual RPC size.
1522           Therefore, the <literal>max_pages_per_rpc</literal> on the client side
1523           would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
1524           a 16MB RPC.  To temporarily make the change, the following command
1525           should be run on the client to set
1526           <literal>max_pages_per_rpc</literal>:</para>
1527           <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
1528           <para>To persistently make this change, the following command should
1529           be run:</para>
1530           <screen>client$ lctl conf_param <replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
1531           <caution><para>The <literal>brw_size</literal> of an OST can be
1532           changed on the fly.  However, clients have to be remounted to
1533           renegotiate the new RPC size.</para></caution>
1534       </section>
1535   </section>
1536   <section xml:id="dbdoclet.50438272_80545">
1537     <title>
1538     <indexterm>
1539       <primary>tuning</primary>
1540       <secondary>for small files</secondary>
1541     </indexterm>Improving Lustre I/O Performance for Small Files</title>
1542     <para>An environment where an application writes small file chunks from
1543     many clients to a single file can result in poor I/O performance. To
1544     improve the performance of the Lustre file system with small files:</para>
1545     <itemizedlist>
1546       <listitem>
1547         <para>Have the application aggregate writes some amount before
1548         submitting them to the Lustre file system. By default, the Lustre
1549         software enforces POSIX coherency semantics, so it results in lock
1550         ping-pong between client nodes if they are all writing to the same
1551         file at one time.</para>
1552         <para>Using MPI-IO Collective Write functionality in
1553         the Lustre ADIO driver is one way to achieve this in a straight
1554         forward manner if the application is already using MPI-IO.</para>
1555       </listitem>
1556       <listitem>
1557         <para>Have the application do 4kB
1558         <literal>O_DIRECT</literal> sized I/O to the file and disable locking
1559         on the output file. This avoids partial-page IO submissions and, by
1560         disabling locking, you avoid contention between clients.</para>
1561       </listitem>
1562       <listitem>
1563         <para>Have the application write contiguous data.</para>
1564       </listitem>
1565       <listitem>
1566         <para>Add more disks or use SSD disks for the OSTs. This dramatically
1567         improves the IOPS rate. Consider creating larger OSTs rather than many
1568         smaller OSTs due to less overhead (journal, connections, etc).</para>
1569       </listitem>
1570       <listitem>
1571         <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
1572         overhead for writing small chunks of data to disk.</para>
1573       </listitem>
1574     </itemizedlist>
1575   </section>
1576   <section xml:id="dbdoclet.50438272_45406">
1577     <title>
1578     <indexterm>
1579       <primary>tuning</primary>
1580       <secondary>write performance</secondary>
1581     </indexterm>Understanding Why Write Performance is Better Than Read
1582     Performance</title>
1583     <para>Typically, the performance of write operations on a Lustre cluster is
1584     better than read operations. When doing writes, all clients are sending
1585     write RPCs asynchronously. The RPCs are allocated, and written to disk in
1586     the order they arrive. In many cases, this allows the back-end storage to
1587     aggregate writes efficiently.</para>
1588     <para>In the case of read operations, the reads from clients may come in a
1589     different order and need a lot of seeking to get read from the disk. This
1590     noticeably hampers the read throughput.</para>
1591     <para>Currently, there is no readahead on the OSTs themselves, though the
1592     clients do readahead. If there are lots of clients doing reads it would not
1593     be possible to do any readahead in any case because of memory consumption
1594     (consider that even a single RPC (1 MB) readahead for 1000 clients would
1595     consume 1 GB of RAM).</para>
1596     <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
1597     there is also additional CPU overhead because the client cannot receive
1598     data without copying it from the network buffers. In the write case, the
1599     client CAN send data without the additional data copy. This means that the
1600     client is more likely to become CPU-bound during reads than writes.</para>
1601   </section>
1602 </chapter>