1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
5 <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
6 <para>This chapter contains information about tuning a Lustre file system for
7 better performance.</para>
9 <para>Many options in the Lustre software are set by means of kernel module
10 parameters. These parameters are contained in the
11 <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
13 <section xml:id="dbdoclet.50438272_55226">
16 <primary>tuning</primary>
19 <primary>tuning</primary>
20 <secondary>service threads</secondary>
21 </indexterm>Optimizing the Number of Service Threads</title>
22 <para>An OSS can have a minimum of two service threads and a maximum of 512
23 service threads. The number of service threads is a function of how much
24 RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
25 If the load on the OSS node is high, new service threads will be started in
26 order to process more requests concurrently, up to 4x the initial number of
27 threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
28 default thread count is 32 and the maximum thread count is 128.</para>
29 <para>Increasing the size of the thread pool may help when:</para>
32 <para>Several OSTs are exported from a single OSS</para>
35 <para>Back-end storage is running synchronously</para>
38 <para>I/O completions take excessive time due to slow storage</para>
41 <para>Decreasing the size of the thread pool may help if:</para>
44 <para>Clients are overwhelming the storage capacity</para>
47 <para>There are lots of "slow I/O" or similar messages</para>
50 <para>Increasing the number of I/O threads allows the kernel and storage to
51 aggregate many writes together for more efficient disk I/O. The OSS thread
52 pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
53 size + 0.5 MB) for internal I/O buffers.</para>
54 <para>It is very important to consider memory consumption when increasing
55 the thread pool size. Drives are only able to sustain a certain amount of
56 parallel I/O activity before performance is degraded, due to the high
57 number of seeks and the OST threads just waiting for I/O. In this
58 situation, it may be advisable to decrease the load by decreasing the
59 number of OST threads.</para>
60 <para>Determining the optimum number of OST threads is a process of trial
61 and error, and varies for each particular configuration. Variables include
62 the number of OSTs on each OSS, number and speed of disks, RAID
63 configuration, and available RAM. You may want to start with a number of
64 OST threads equal to the number of actual disk spindles on the node. If you
65 use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
66 of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
67 performance of clients during usual workloads. If performance is degraded,
68 increase the thread count and see how that works until performance is
69 degraded again or you reach satisfactory performance.</para>
71 <para>If there are too many threads, the latency for individual I/O
72 requests can become very high and should be avoided. Set the desired
73 maximum thread count permanently using the method described above.</para>
78 <primary>tuning</primary>
79 <secondary>OSS threads</secondary>
80 </indexterm>Specifying the OSS Service Thread Count</title>
82 <literal>oss_num_threads</literal> parameter enables the number of OST
83 service threads to be specified at module load time on the OSS
86 options ost oss_num_threads={N}
88 <para>After startup, the minimum and maximum number of OSS thread counts
90 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
91 the tunable at runtime, run:</para>
94 lctl {get,set}_param {service}.thread_{min,max,started}
97 <para>Lustre software release 2.3 introduced binding service threads to
98 CPU partition. This works in a similar fashion to binding of threads on
99 MDS. MDS thread tuning is covered in
100 <xref linkend="dbdoclet.mdsbinding" />.</para>
104 <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
106 <literal>[EXPRESSION]</literal>.</para>
110 <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
112 <literal>[EXPRESSION]</literal>.</para>
115 <para>For further details, see
116 <xref linkend="dbdoclet.50438271_87260" />.</para>
118 <section xml:id="dbdoclet.mdstuning">
121 <primary>tuning</primary>
122 <secondary>MDS threads</secondary>
123 </indexterm>Specifying the MDS Service Thread Count</title>
125 <literal>mds_num_threads</literal> parameter enables the number of MDS
126 service threads to be specified at module load time on the MDS
129 options mds mds_num_threads={N}
131 <para>After startup, the minimum and maximum number of MDS thread counts
133 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
134 the tunable at runtime, run:</para>
137 lctl {get,set}_param {service}.thread_{min,max,started}
140 <para>For details, see
141 <xref linkend="dbdoclet.50438271_87260" />.</para>
142 <para>At this time, no testing has been done to determine the optimal
143 number of MDS threads. The default value varies, based on server size, up
144 to a maximum of 32. The maximum number of threads (
145 <literal>MDS_MAX_THREADS</literal>) is 512.</para>
147 <para>The OSS and MDS automatically start new service threads
148 dynamically, in response to server load within a factor of 4. The
149 default value is calculated the same way as before. Setting the
150 <literal>_mu_threads</literal> module parameter disables automatic
151 thread creation behavior.</para>
153 <para>Lustre software release 2.3 introduced new parameters to provide
154 more control to administrators.</para>
158 <literal>mds_rdpg_num_threads</literal> controls the number of threads
159 in providing the read page service. The read page service handles
160 file close and readdir operations.</para>
164 <literal>mds_attr_num_threads</literal> controls the number of threads
165 in providing the setattr service to clients running Lustre software
170 <para>Default values for the thread counts are automatically selected.
171 The values are chosen to best exploit the number of CPUs present in the
172 system and to provide best overall performance for typical
177 <section xml:id="dbdoclet.mdsbinding" condition='l23'>
180 <primary>tuning</primary>
181 <secondary>MDS binding</secondary>
182 </indexterm>Binding MDS Service Thread to CPU Partitions</title>
183 <para>With the introduction of Node Affinity (
184 <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
185 can be bound to particular CPU partitions (CPTs). Default values for
186 bindings are selected automatically to provide good overall performance for
187 a given CPU count. However, an administrator can deviate from these setting
188 if they choose.</para>
192 <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
193 service threads to CPTs defined by
194 <literal>EXPRESSION</literal>. For example
195 <literal>mdt_num_cpts=[0-3]</literal> will bind the MDS service threads
197 <literal>CPT[0,1,2,3]</literal>.</para>
201 <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
202 service threads to CPTs defined by
203 <literal>EXPRESSION</literal>. The read page service handles file close
204 and readdir requests. For example
205 <literal>mdt_rdpg_num_cpts=[4]</literal> will bind the read page threads
207 <literal>CPT4</literal>.</para>
211 <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
212 service threads to CPTs defined by
213 <literal>EXPRESSION</literal>.</para>
217 <section xml:id="dbdoclet.50438272_73839">
220 <primary>LNET</primary>
221 <secondary>tuning</secondary>
224 <primary>tuning</primary>
225 <secondary>LNET</secondary>
226 </indexterm>Tuning LNET Parameters</title>
227 <para>This section describes LNET tunables, the use of which may be
228 necessary on some systems to improve performance. To test the performance
229 of your Lustre network, see
230 <xref linkend='lnetselftest' />.</para>
232 <title>Transmit and Receive Buffer Size</title>
233 <para>The kernel allocates buffers for sending and receiving messages on
236 <literal>ksocklnd</literal> has separate parameters for the transmit and
237 receive buffers.</para>
239 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
241 <para>If these parameters are left at the default value (0), the system
242 automatically tunes the transmit and receive buffer size. In almost every
243 case, this default produces the best performance. Do not attempt to tune
244 these parameters unless you are a network expert.</para>
247 <title>Hardware Interrupts (
248 <literal>enable_irq_affinity</literal>)</title>
249 <para>The hardware interrupts that are generated by network adapters may
250 be handled by any CPU in the system. In some cases, we would like network
251 traffic to remain local to a single CPU to help keep the processor cache
252 warm and minimize the impact of context switches. This is helpful when an
253 SMP system has more than one network interface and ideal when the number
254 of interfaces equals the number of CPUs. To enable the
255 <literal>enable_irq_affinity</literal> parameter, enter:</para>
257 options ksocklnd enable_irq_affinity=1
259 <para>In other cases, if you have an SMP platform with a single fast
260 interface such as 10 Gb Ethernet and more than two CPUs, you may see
261 performance improve by turning this parameter off.</para>
263 options ksocklnd enable_irq_affinity=0
265 <para>By default, this parameter is off. As always, you should test the
266 performance to compare the impact of changing this parameter.</para>
268 <section condition='l23'>
271 <primary>tuning</primary>
272 <secondary>Network interface binding</secondary>
273 </indexterm>Binding Network Interface Against CPU Partitions</title>
274 <para>Lustre software release 2.3 and beyond provide enhanced network
275 interface control. The enhancement means that an administrator can bind
276 an interface to one or more CPU partitions. Bindings are specified as
277 options to the LNET modules. For more information on specifying module
279 <xref linkend="dbdoclet.50438293_15350" /></para>
281 <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
282 <literal>o2ib0</literal> will be handled by LND threads executing on
283 <literal>CPT0</literal> and
284 <literal>CPT1</literal>. An additional example might be:
285 <literal>tcp1(eth0)[0]</literal>. Messages for
286 <literal>tcp1</literal> are handled by threads on
287 <literal>CPT0</literal>.</para>
292 <primary>tuning</primary>
293 <secondary>Network interface credits</secondary>
294 </indexterm>Network Interface Credits</title>
295 <para>Network interface (NI) credits are shared across all CPU partitions
296 (CPT). For example, if a machine has four CPTs and the number of NI
297 credits is 512, then each partition has 128 credits. If a large number of
298 CPTs exist on the system, LNET checks and validates the NI credits for
299 each CPT to ensure each CPT has a workable number of credits. For
300 example, if a machine has 16 CPTs and the number of NI credits is 256,
301 then each partition only has 16 credits. 16 NI credits is low and could
302 negatively impact performance. As a result, LNET automatically adjusts
304 <literal>peer_credits</literal>(
305 <literal>peer_credits</literal> is 8 by default), so each partition has 64
307 <para>Increasing the number of
308 <literal>credits</literal>/
309 <literal>peer_credits</literal> can improve the performance of high
310 latency networks (at the cost of consuming more memory) by enabling LNET
311 to send more inflight messages to a specific network/peer and keep the
312 pipeline saturated.</para>
313 <para>An administrator can modify the NI credit count using
314 <literal>ksoclnd</literal> or
315 <literal>ko2iblnd</literal>. In the example below, 256 credits are
316 applied to TCP connections.</para>
320 <para>Applying 256 credits to IB connections can be achieved with:</para>
324 <note condition="l23">
325 <para>In Lustre software release 2.3 and beyond, LNET may revalidate
326 the NI credits, so the administrator's request may not persist.</para>
332 <primary>tuning</primary>
333 <secondary>router buffers</secondary>
334 </indexterm>Router Buffers</title>
335 <para>When a node is set up as an LNET router, three pools of buffers are
336 allocated: tiny, small and large. These pools are allocated per CPU
337 partition and are used to buffer messages that arrive at the router to be
338 forwarded to the next hop. The three different buffer sizes accommodate
339 different size messages.</para>
340 <para>If a message arrives that can fit in a tiny buffer then a tiny
341 buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
342 small buffer, then a small buffer is used. Finally if a message does not
343 fit in either a tiny buffer or a small buffer, a large buffer is
345 <para>Router buffers are shared by all CPU partitions. For a machine with
346 a large number of CPTs, the router buffer number may need to be specified
347 manually for best performance. A low number of router buffers risks
348 starving the CPU partitions of resources.</para>
352 <literal>tiny_router_buffers</literal>: Zero payload buffers used for
353 signals and acknowledgements.</para>
357 <literal>small_router_buffers</literal>: 4 KB payload buffers for
358 small messages</para>
362 <literal>large_router_buffers</literal>: 1 MB maximum payload
363 buffers, corresponding to the recommended RPC size of 1 MB.</para>
366 <para>The default setting for router buffers typically results in
367 acceptable performance. LNET automatically sets a default value to reduce
368 the likelihood of resource starvation. The size of a router buffer can be
369 modified as shown in the example below. In this example, the size of the
370 large buffer is modified using the
371 <literal>large_router_buffers</literal> parameter.</para>
373 lnet large_router_buffers=8192
375 <note condition="l23">
376 <para>In Lustre software release 2.3 and beyond, LNET may revalidate
377 the router buffer setting, so the administrator's request may not
384 <primary>tuning</primary>
385 <secondary>portal round-robin</secondary>
386 </indexterm>Portal Round-Robin</title>
387 <para>Portal round-robin defines the policy LNET applies to deliver
388 events and messages to the upper layers. The upper layers are PLRPC
389 service or LNET selftest.</para>
390 <para>If portal round-robin is disabled, LNET will deliver messages to
391 CPTs based on a hash of the source NID. Hence, all messages from a
392 specific peer will be handled by the same CPT. This can reduce data
393 traffic between CPUs. However, for some workloads, this behavior may
394 result in poorly balancing loads across the CPU.</para>
395 <para>If portal round-robin is enabled, LNET will round-robin incoming
396 events across all CPTs. This may balance load better across the CPU but
397 can incur a cross CPU overhead.</para>
398 <para>The current policy can be changed by an administrator with
400 <replaceable>value</replaceable>>
401 /proc/sys/lnet/portal_rotor</literal>. There are four options for
403 <replaceable>value</replaceable>
408 <literal>OFF</literal>
410 <para>Disable portal round-robin on all incoming requests.</para>
414 <literal>ON</literal>
416 <para>Enable portal round-robin on all incoming requests.</para>
420 <literal>RR_RT</literal>
422 <para>Enable portal round-robin only for routed messages.</para>
426 <literal>HASH_RT</literal>
428 <para>Routed messages will be delivered to the upper layer by hash of
429 source NID (instead of NID of router.) This is the default
435 <title>LNET Peer Health</title>
436 <para>Two options are available to help determine peer health:
440 <literal>peer_timeout</literal>- The timeout (in seconds) before an
441 aliveness query is sent to a peer. For example, if
442 <literal>peer_timeout</literal> is set to
443 <literal>180sec</literal>, an aliveness query is sent to the peer
444 every 180 seconds. This feature only takes effect if the node is
445 configured as an LNET router.</para>
446 <para>In a routed environment, the
447 <literal>peer_timeout</literal> feature should always be on (set to a
448 value in seconds) on routers. If the router checker has been enabled,
449 the feature should be turned off by setting it to 0 on clients and
451 <para>For a non-routed scenario, enabling the
452 <literal>peer_timeout</literal> option provides health information
453 such as whether a peer is alive or not. For example, a client is able
454 to determine if an MGS or OST is up when it sends it a message. If a
455 response is received, the peer is alive; otherwise a timeout occurs
456 when the request is made.</para>
458 <literal>peer_timeout</literal> should be set to no less than the LND
459 timeout setting. For more information about LND timeouts, see
460 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
461 linkend="section_c24_nt5_dl" />.</para>
463 <literal>o2iblnd</literal>(IB) driver is used,
464 <literal>peer_timeout</literal> should be at least twice the value of
466 <literal>ko2iblnd</literal> keepalive option. for more information
467 about keepalive options, see
468 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
469 linkend="section_ngq_qhy_zl" />.</para>
473 <literal>avoid_asym_router_failure</literal>– When set to 1, the
474 router checker running on the client or a server periodically pings
475 all the routers corresponding to the NIDs identified in the routes
476 parameter setting on the node to determine the status of each router
477 interface. The default setting is 1. (For more information about the
478 LNET routes parameter, see
479 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
480 linkend="dbdoclet.50438216_71227" /></para>
481 <para>A router is considered down if any of its NIDs are down. For
482 example, router X has three NIDs:
483 <literal>Xnid1</literal>,
484 <literal>Xnid2</literal>, and
485 <literal>Xnid3</literal>. A client is connected to the router via
486 <literal>Xnid1</literal>. The client has router checker enabled. The
487 router checker periodically sends a ping to the router via
488 <literal>Xnid1</literal>. The router responds to the ping with the
489 status of each of its NIDs. In this case, it responds with
490 <literal>Xnid1=up</literal>,
491 <literal>Xnid2=up</literal>,
492 <literal>Xnid3=down</literal>. If
493 <literal>avoid_asym_router_failure==1</literal>, the router is
494 considered down if any of its NIDs are down, so router X is
495 considered down and will not be used for routing messages. If
496 <literal>avoid_asym_router_failure==0</literal>, router X will
497 continue to be used for routing messages.</para>
499 </itemizedlist></para>
500 <para>The following router checker parameters must be set to the maximum
501 value of the corresponding setting for this option on any client or
506 <literal>dead_router_check_interval</literal>
511 <literal>live_router_check_interval</literal>
516 <literal>router_ping_timeout</literal>
519 </itemizedlist></para>
520 <para>For example, the
521 <literal>dead_router_check_interval</literal> parameter on any router must
525 <section xml:id="dbdoclet.libcfstuning">
528 <primary>tuning</primary>
529 <secondary>libcfs</secondary>
530 </indexterm>libcfs Tuning</title>
531 <para>By default, the Lustre software will automatically generate CPU
532 partitions (CPT) based on the number of CPUs in the system. The CPT number
533 will be 1 if the online CPU number is less than five.</para>
534 <para>The CPT number can be explicitly set on the libcfs module using
535 <literal>cpu_npartitions=NUMBER</literal>. The value of
536 <literal>cpu_npartitions</literal> must be an integer between 1 and the
537 number of online CPUs.</para>
539 <para>Setting CPT to 1 will disable most of the SMP Node Affinity
540 functionality.</para>
543 <title>CPU Partition String Patterns</title>
544 <para>CPU partitions can be described using string pattern notation. For
549 <literal>cpu_pattern="0[0,2,4,6] 1[1,3,5,7]</literal>
551 <para>Create two CPTs, CPT0 contains CPU[0, 2, 4, 6]. CPT1 contains
556 <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
558 <para>Create two CPTs, CPT0 contains all CPUs in NUMA node[0-3], CPT1
559 contains all CPUs in NUMA node [4-7].</para>
562 <para>The current configuration of the CPU partition can be read from
563 <literal>/proc/sys/lnet/cpu_partitions</literal></para>
566 <section xml:id="dbdoclet.lndtuning">
569 <primary>tuning</primary>
570 <secondary>LND tuning</secondary>
571 </indexterm>LND Tuning</title>
572 <para>LND tuning allows the number of threads per CPU partition to be
573 specified. An administrator can set the threads for both
574 <literal>ko2iblnd</literal> and
575 <literal>ksocklnd</literal> using the
576 <literal>nscheds</literal> parameter. This adjusts the number of threads for
577 each partition, not the overall number of threads on the LND.</para>
579 <para>Lustre software release 2.3 has greatly decreased the default
580 number of threads for
581 <literal>ko2iblnd</literal> and
582 <literal>ksocklnd</literal> on high-core count machines. The current
583 default values are automatically set and are chosen to work well across a
584 number of typical scenarios.</para>
587 <section xml:id="dbdoclet.nrstuning" condition='l24'>
590 <primary>tuning</primary>
591 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
592 </indexterm>Network Request Scheduler (NRS) Tuning</title>
593 <para>The Network Request Scheduler (NRS) allows the administrator to
594 influence the order in which RPCs are handled at servers, on a per-PTLRPC
595 service basis, by providing different policies that can be activated and
596 tuned in order to influence the RPC ordering. The aim of this is to provide
597 for better performance, and possibly discrete performance characteristics
598 using future policies.</para>
599 <para>The NRS policy state of a PTLRPC service can be read and set via the
600 <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
601 service's NRS policy state, run:</para>
603 lctl get_param {service}.nrs_policies
605 <para>For example, to read the NRS policy state of the
606 <literal>ost_io</literal> service, run:</para>
608 $ lctl get_param ost.OSS.ost_io.nrs_policies
609 ost.OSS.ost_io.nrs_policies=
636 high_priority_requests:
662 <para>NRS policy state is shown in either one or two sections, depending on
663 the PTLRPC service being queried. The first section is named
664 <literal>regular_requests</literal> and is available for all PTLRPC
665 services, optionally followed by a second section which is named
666 <literal>high_priority_requests</literal>. This is because some PTLRPC
667 services are able to treat some types of RPCs as higher priority ones, such
668 that they are handled by the server with higher priority compared to other,
669 regular RPC traffic. For PTLRPC services that do not support high-priority
670 RPCs, you will only see the
671 <literal>regular_requests</literal> section.</para>
672 <para>There is a separate instance of each NRS policy on each PTLRPC
673 service for handling regular and high-priority RPCs (if the service
674 supports high-priority RPCs). For each policy instance, the following
675 fields are shown:</para>
676 <informaltable frame="all">
678 <colspec colname="c1" colwidth="50*" />
679 <colspec colname="c2" colwidth="50*" />
684 <emphasis role="bold">Field</emphasis>
689 <emphasis role="bold">Description</emphasis>
698 <literal>name</literal>
702 <para>The name of the policy.</para>
708 <literal>state</literal>
712 <para>The state of the policy; this can be any of
713 <literal>invalid, stopping, stopped, starting, started</literal>.
714 A fully enabled policy is in the
715 <literal>started</literal> state.</para>
721 <literal>fallback</literal>
725 <para>Whether the policy is acting as a fallback policy or not. A
726 fallback policy is used to handle RPCs that other enabled
727 policies fail to handle, or do not support the handling of. The
729 <literal>no, yes</literal>. Currently, only the FIFO policy can
730 act as a fallback policy.</para>
736 <literal>queued</literal>
740 <para>The number of RPCs that the policy has waiting to be
747 <literal>active</literal>
751 <para>The number of RPCs that the policy is currently
758 <para>To enable an NRS policy on a PTLRPC service run:</para>
760 lctl set_param {service}.nrs_policies=
761 <replaceable>policy_name</replaceable>
763 <para>This will enable the policy
764 <replaceable>policy_name</replaceable>for both regular and high-priority
765 RPCs (if the PLRPC service supports high-priority RPCs) on the given
766 service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
769 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
770 ldlm.services.ldlm_cbd.nrs_policies=crrn
773 <para>For PTLRPC services that support high-priority RPCs, you can also
775 <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
776 for handling only regular or high-priority RPCs on a given PTLRPC service,
779 lctl set_param {service}.nrs_policies="
780 <replaceable>policy_name</replaceable>
781 <replaceable>reg|hp</replaceable>"
783 <para>For example, to enable the TRR policy for handling only regular, but
784 not high-priority RPCs on the
785 <literal>ost_io</literal> service, run:</para>
787 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
788 ost.OSS.ost_io.nrs_policies="trr reg"
792 <para>When enabling an NRS policy, the policy name must be given in
793 lower-case characters, otherwise the operation will fail with an error
799 <primary>tuning</primary>
800 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
801 <tertiary>first in, first out (FIFO) policy</tertiary>
802 </indexterm>First In, First Out (FIFO) policy</title>
803 <para>The first in, first out (FIFO) policy handles RPCs in a service in
804 the same order as they arrive from the LNET layer, so no special
805 processing takes place to modify the RPC handling stream. FIFO is the
806 default policy for all types of RPCs on all PTLRPC services, and is
807 always enabled irrespective of the state of other policies, so that it
808 can be used as a backup policy, in case a more elaborate policy that has
809 been enabled fails to handle an RPC, or does not support handling a given
811 <para>The FIFO policy has no tunables that adjust its behaviour.</para>
816 <primary>tuning</primary>
817 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
818 <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
819 </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
820 <para>The client round-robin over NIDs (CRR-N) policy performs batched
821 round-robin scheduling of all types of RPCs, with each batch consisting
822 of RPCs originating from the same client node, as identified by its NID.
823 CRR-N aims to provide for better resource utilization across the cluster,
824 and to help shorten completion times of jobs in some cases, by
825 distributing available bandwidth more evenly across all clients.</para>
826 <para>The CRR-N policy can be enabled on all types of PTLRPC services,
827 and has the following tunable that can be used to adjust its
832 <literal>{service}.nrs_crrn_quantum</literal>
835 <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
836 maximum allowed size of each batch of RPCs; the unit of measure is in
837 number of RPCs. To read the maximum allowed batch size of a CRR-N
840 lctl get_param {service}.nrs_crrn_quantum
842 <para>For example, to read the maximum allowed batch size of a CRR-N
843 policy on the ost_io service, run:</para>
845 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
846 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
850 <para>You can see that there is a separate maximum allowed batch size
852 <literal>reg_quantum</literal>) and high-priority (
853 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
854 high-priority RPCs).</para>
855 <para>To set the maximum allowed batch size of a CRR-N policy on a
856 given service, run:</para>
858 lctl set_param {service}.nrs_crrn_quantum=
859 <replaceable>1-65535</replaceable>
861 <para>This will set the maximum allowed batch size on a given
862 service, for both regular and high-priority RPCs (if the PLRPC
863 service supports high-priority RPCs), to the indicated value.</para>
864 <para>For example, to set the maximum allowed batch size on the
865 ldlm_canceld service to 16 RPCs, run:</para>
867 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
868 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
871 <para>For PTLRPC services that support high-priority RPCs, you can
872 also specify a different maximum allowed batch size for regular and
873 high-priority RPCs, by running:</para>
875 $ lctl set_param {service}.nrs_crrn_quantum=
876 <replaceable>reg_quantum|hp_quantum</replaceable>:
877 <replaceable>1-65535</replaceable>"
879 <para>For example, to set the maximum allowed batch size on the
880 ldlm_canceld service, for high-priority RPCs to 32, run:</para>
882 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
883 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
886 <para>By using the last method, you can also set the maximum regular
887 and high-priority RPC batch sizes to different values, in a single
888 command invocation.</para>
895 <primary>tuning</primary>
896 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
897 <tertiary>object-based round-robin (ORR) policy</tertiary>
898 </indexterm>Object-based Round-Robin (ORR) policy</title>
899 <para>The object-based round-robin (ORR) policy performs batched
900 round-robin scheduling of bulk read write (brw) RPCs, with each batch
901 consisting of RPCs that pertain to the same backend-file system object,
902 as identified by its OST FID.</para>
903 <para>The ORR policy is only available for use on the ost_io service. The
904 RPC batches it forms can potentially consist of mixed bulk read and bulk
905 write RPCs. The RPCs in each batch are ordered in an ascending manner,
906 based on either the file offsets, or the physical disk offsets of each
907 RPC (only applicable to bulk read RPCs).</para>
908 <para>The aim of the ORR policy is to provide for increased bulk read
909 throughput in some cases, by ordering bulk read RPCs (and potentially
910 bulk write RPCs), and thus minimizing costly disk seek operations.
911 Performance may also benefit from any resulting improvement in resource
912 utilization, or by taking advantage of better locality of reference
914 <para>The ORR policy has the following tunables that can be used to
915 adjust its behaviour:</para>
919 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
922 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
923 the maximum allowed size of each batch of RPCs; the unit of measure
924 is in number of RPCs. To read the maximum allowed batch size of the
925 ORR policy, run:</para>
927 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
928 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
932 <para>You can see that there is a separate maximum allowed batch size
934 <literal>reg_quantum</literal>) and high-priority (
935 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
936 high-priority RPCs).</para>
937 <para>To set the maximum allowed batch size for the ORR policy,
940 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
941 <replaceable>1-65535</replaceable>
943 <para>This will set the maximum allowed batch size for both regular
944 and high-priority RPCs, to the indicated value.</para>
945 <para>You can also specify a different maximum allowed batch size for
946 regular and high-priority RPCs, by running:</para>
948 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
949 <replaceable>reg_quantum|hp_quantum</replaceable>:
950 <replaceable>1-65535</replaceable>
952 <para>For example, to set the maximum allowed batch size for regular
953 RPCs to 128, run:</para>
955 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
956 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
959 <para>By using the last method, you can also set the maximum regular
960 and high-priority RPC batch sizes to different values, in a single
961 command invocation.</para>
965 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
968 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
969 determines whether the ORR policy orders RPCs within each batch based
970 on logical file offsets or physical disk offsets. To read the offset
971 type value for the ORR policy, run:</para>
973 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
974 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
975 hp_offset_type:logical
978 <para>You can see that there is a separate offset type value for
980 <literal>reg_offset_type</literal>) and high-priority (
981 <literal>hp_offset_type</literal>) RPCs.</para>
982 <para>To set the ordering type for the ORR policy, run:</para>
984 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
985 <replaceable>physical|logical</replaceable>
987 <para>This will set the offset type for both regular and
988 high-priority RPCs, to the indicated value.</para>
989 <para>You can also specify a different offset type for regular and
990 high-priority RPCs, by running:</para>
992 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
993 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
994 <replaceable>physical|logical</replaceable>
996 <para>For example, to set the offset type for high-priority RPCs to
997 physical disk offsets, run:</para>
999 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1000 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1002 <para>By using the last method, you can also set offset type for
1003 regular and high-priority RPCs to different values, in a single
1004 command invocation.</para>
1006 <para>Irrespective of the value of this tunable, only logical
1007 offsets can, and are used for ordering bulk write RPCs.</para>
1012 <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1015 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1016 the type of RPCs that the ORR policy will handle. To read the types
1017 of supported RPCs by the ORR policy, run:</para>
1019 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1020 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1021 hp_supported=reads_and_writes
1024 <para>You can see that there is a separate supported 'RPC types'
1026 <literal>reg_supported</literal>) and high-priority (
1027 <literal>hp_supported</literal>) RPCs.</para>
1028 <para>To set the supported RPC types for the ORR policy, run:</para>
1030 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1031 <replaceable>reads|writes|reads_and_writes</replaceable>
1033 <para>This will set the supported RPC types for both regular and
1034 high-priority RPCs, to the indicated value.</para>
1035 <para>You can also specify a different supported 'RPC types' value
1036 for regular and high-priority RPCs, by running:</para>
1038 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1039 <replaceable>reg_supported|hp_supported</replaceable>:
1040 <replaceable>reads|writes|reads_and_writes</replaceable>
1042 <para>For example, to set the supported RPC types to bulk read and
1043 bulk write RPCs for regular requests, run:</para>
1046 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1047 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1050 <para>By using the last method, you can also set the supported RPC
1051 types for regular and high-priority RPC to different values, in a
1052 single command invocation.</para>
1059 <primary>tuning</primary>
1060 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1061 <tertiary>Target-based round-robin (TRR) policy</tertiary>
1062 </indexterm>Target-based Round-Robin (TRR) policy</title>
1063 <para>The target-based round-robin (TRR) policy performs batched
1064 round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1065 that pertain to the same OST, as identified by its OST index.</para>
1066 <para>The TRR policy is identical to the object-based round-robin (ORR)
1067 policy, apart from using the brw RPC's target OST index instead of the
1068 backend-fs object's OST FID, for determining the RPC scheduling order.
1069 The goals of TRR are effectively the same as for ORR, and it uses the
1070 following tunables to adjust its behaviour:</para>
1074 <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1076 <para>The purpose of this tunable is exactly the same as for the
1077 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1078 policy, and you can use it in exactly the same way.</para>
1082 <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1084 <para>The purpose of this tunable is exactly the same as for the
1085 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1086 ORR policy, and you can use it in exactly the same way.</para>
1090 <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1092 <para>The purpose of this tunable is exactly the same as for the
1093 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1094 ORR policy, and you can use it in exactly the sme way.</para>
1098 <section condition='l26'>
1101 <primary>tuning</primary>
1102 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1103 <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1104 </indexterm>Token Bucket Filter (TBF) policy</title>
1105 <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1106 Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1107 (Quality of Service) purposes.</para>
1109 <title>The internal structure of TBF policy</title>
1112 <imagedata scalefit="1" width="100%"
1113 fileref="figures/TBF_policy.svg" />
1116 <phrase>The internal structure of TBF policy</phrase>
1120 <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1121 according to its classification. The classification of RPC requests is
1122 based on either NID or JobID of the RPC according to the configure of
1123 TBF. TBF policy maintains multiple queues in the system, one queue for
1124 each category in the classification of RPC requests. The requests waits
1125 for tokens in the FIFO queue before they have been handled so as to keep
1126 the RPC rates under the limits.</para>
1127 <para>When Lustre services are too busy to handle all of the requests in
1128 time, all of the specified rates of the queues will not be satisfied.
1129 Nothing bad will happen except some of the RPC rates are slower than
1130 configured. In this case, the queue with higher rate will have an
1131 advantage over the queues with lower rates, but none of them will be
1133 <para>To manage the RPC rate of queues, we don't need to set the rate of
1134 each queue manually. Instead, we define rules which TBF policy matches to
1135 determine RPC rate limits. All of the defined rules are organized as an
1136 ordered list. Whenever a queue is newly created, it goes though the rule
1137 list and takes the first matched rule as its rule, so that the queue
1138 knows its RPC token rate. A rule can be added to or removed from the list
1139 at run time. Whenever the list of rules is changed, the queues will
1140 update their matched rules.</para>
1144 <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
1146 <para>The format of the rule start command of TBF policy is as
1149 $ lctl set_param x.x.x.nrs_tbf_rule=
1151 <replaceable>rule_name</replaceable>
1152 <replaceable>arguments</replaceable>..."
1155 <replaceable>rule_name</replaceable>' argument is a string which
1156 identifies a rule. The format of the '
1157 <replaceable>arguments</replaceable>' is changing according to the
1158 type of the TBF policy. For the NID based TBF policy, its format is
1161 $ lctl set_param x.x.x.nrs_tbf_rule=
1163 <replaceable>rule_name</replaceable> {
1164 <replaceable>nidlist</replaceable>}
1165 <replaceable>rate</replaceable>"
1167 <para>The format of '
1168 <replaceable>nidlist</replaceable>' argument is the same as the
1169 format when configuring LNET route. The '
1170 <replaceable>rate</replaceable>' argument is the RPC rate of the
1171 rule, means the upper limit number of requests per second.</para>
1172 <para>Following commands are valid. Please note that a newly started
1173 rule is prior to old rules, so the order of starting rules is
1174 critical too.</para>
1176 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1177 "start other_clients {192.168.*.*@tcp} 50"
1180 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1181 "start loginnode {192.168.1.1@tcp} 100"
1183 <para>General rule can be replaced by two rules (reg and hp) as
1186 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1187 "reg start loginnode {192.168.1.1@tcp} 100"
1190 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1191 "hp start loginnode {192.168.1.1@tcp} 100"
1194 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1195 "start computes {192.168.1.[2-128]@tcp} 500"
1197 <para>The above rules will put an upper limit for servers to process
1198 at most 5x as many RPCs from compute nodes as login nodes.</para>
1199 <para>For the JobID (please see
1200 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1201 linkend="dbdoclet.jobstats" />for more details) based TBF policy, its
1202 format is as follows:</para>
1204 $ lctl set_param x.x.x.nrs_tbf_rule=
1206 <replaceable>name</replaceable> {
1207 <replaceable>jobid_list</replaceable>}
1208 <replaceable>rate</replaceable>"
1210 <para>Following commands are valid:</para>
1212 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1213 "start user1 {iozone.500 dd.500} 100"
1216 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1217 "start iozone_user1 {iozone.500} 100"
1219 <para>Same as nid, could use reg and hp rules separately:</para>
1221 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1222 "hp start iozone_user1 {iozone.500} 100"
1225 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1226 "reg start iozone_user1 {iozone.500} 100"
1228 <para>The format of the rule change command of TBF policy is as
1231 $ lctl set_param x.x.x.nrs_tbf_rule=
1233 <replaceable>rule_name</replaceable>
1234 <replaceable>rate</replaceable>"
1236 <para>Following commands are valid:</para>
1238 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
1241 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
1244 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
1246 <para>The format of the rule stop command of TBF policy is as
1249 $ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1250 <replaceable>rule_name</replaceable>"
1252 <para>Following commands are valid:</para>
1254 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1257 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1260 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
1266 <section xml:id="dbdoclet.50438272_25884">
1269 <primary>tuning</primary>
1270 <secondary>lockless I/O</secondary>
1271 </indexterm>Lockless I/O Tunables</title>
1272 <para>The lockless I/O tunable feature allows servers to ask clients to do
1273 lockless I/O (liblustre-style where the server does the locking) on
1274 contended files.</para>
1275 <para>The lockless I/O patch introduces these tunables:</para>
1279 <emphasis role="bold">OST-side:</emphasis>
1282 /proc/fs/lustre/ldlm/namespaces/filter-lustre-*
1285 <literal>contended_locks</literal>- If the number of lock conflicts in
1286 the scan of granted and waiting queues at contended_locks is exceeded,
1287 the resource is considered to be contended.</para>
1289 <literal>contention_seconds</literal>- The resource keeps itself in a
1290 contended state as set in the parameter.</para>
1292 <literal>max_nolock_bytes</literal>- Server-side locking set only for
1293 requests less than the blocks set in the
1294 <literal>max_nolock_bytes</literal> parameter. If this tunable is set to
1295 zero (0), it disables server-side locking for read/write
1300 <emphasis role="bold">Client-side:</emphasis>
1303 /proc/fs/lustre/llite/lustre-*
1306 <literal>contention_seconds</literal>-
1307 <literal>llite</literal> inode remembers its contended state for the
1308 time specified in this parameter.</para>
1312 <emphasis role="bold">Client-side statistics:</emphasis>
1315 <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
1316 rows for lockless I/O statistics.</para>
1318 <literal>lockless_read_bytes</literal> and
1319 <literal>lockless_write_bytes</literal>- To count the total bytes read
1320 or written, the client makes its own decisions based on the request
1321 size. The client does not communicate with the server if the request
1322 size is smaller than the
1323 <literal>min_nolock_size</literal>, without acquiring locks by the
1328 <section xml:id="dbdoclet.50438272_80545">
1331 <primary>tuning</primary>
1332 <secondary>for small files</secondary>
1333 </indexterm>Improving Lustre File System Performance When Working with
1335 <para>An environment where an application writes small file chunks from
1336 many clients to a single file will result in bad I/O performance. To
1337 improve the performance of the Lustre file system with small files:</para>
1340 <para>Have the application aggregate writes some amount before
1341 submitting them to the Lustre file system. By default, the Lustre
1342 software enforces POSIX coherency semantics, so it results in lock
1343 ping-pong between client nodes if they are all writing to the same file
1347 <para>Have the application do 4kB
1348 <literal>O_DIRECT</literal> sized I/O to the file and disable locking on
1349 the output file. This avoids partial-page IO submissions and, by
1350 disabling locking, you avoid contention between clients.</para>
1353 <para>Have the application write contiguous data.</para>
1356 <para>Add more disks or use SSD disks for the OSTs. This dramatically
1357 improves the IOPS rate. Consider creating larger OSTs rather than many
1358 smaller OSTs due to less overhead (journal, connections, etc).</para>
1361 <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
1362 overhead for writing small chunks of data to disk.</para>
1366 <section xml:id="dbdoclet.50438272_45406">
1369 <primary>tuning</primary>
1370 <secondary>write performance</secondary>
1371 </indexterm>Understanding Why Write Performance is Better Than Read
1373 <para>Typically, the performance of write operations on a Lustre cluster is
1374 better than read operations. When doing writes, all clients are sending
1375 write RPCs asynchronously. The RPCs are allocated, and written to disk in
1376 the order they arrive. In many cases, this allows the back-end storage to
1377 aggregate writes efficiently.</para>
1378 <para>In the case of read operations, the reads from clients may come in a
1379 different order and need a lot of seeking to get read from the disk. This
1380 noticeably hampers the read throughput.</para>
1381 <para>Currently, there is no readahead on the OSTs themselves, though the
1382 clients do readahead. If there are lots of clients doing reads it would not
1383 be possible to do any readahead in any case because of memory consumption
1384 (consider that even a single RPC (1 MB) readahead for 1000 clients would
1385 consume 1 GB of RAM).</para>
1386 <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
1387 there is also additional CPU overhead because the client cannot receive
1388 data without copying it from the network buffers. In the write case, the
1389 client CAN send data without the additional data copy. This means that the
1390 client is more likely to become CPU-bound during reads than writes.</para>