1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
5 <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
6 <para>This chapter contains information about tuning a Lustre file system for
7 better performance.</para>
9 <para>Many options in the Lustre software are set by means of kernel module
10 parameters. These parameters are contained in the
11 <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
13 <section xml:id="dbdoclet.50438272_55226">
16 <primary>tuning</primary>
19 <primary>tuning</primary>
20 <secondary>service threads</secondary>
21 </indexterm>Optimizing the Number of Service Threads</title>
22 <para>An OSS can have a minimum of two service threads and a maximum of 512
23 service threads. The number of service threads is a function of how much
24 RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
25 If the load on the OSS node is high, new service threads will be started in
26 order to process more requests concurrently, up to 4x the initial number of
27 threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
28 default thread count is 32 and the maximum thread count is 128.</para>
29 <para>Increasing the size of the thread pool may help when:</para>
32 <para>Several OSTs are exported from a single OSS</para>
35 <para>Back-end storage is running synchronously</para>
38 <para>I/O completions take excessive time due to slow storage</para>
41 <para>Decreasing the size of the thread pool may help if:</para>
44 <para>Clients are overwhelming the storage capacity</para>
47 <para>There are lots of "slow I/O" or similar messages</para>
50 <para>Increasing the number of I/O threads allows the kernel and storage to
51 aggregate many writes together for more efficient disk I/O. The OSS thread
52 pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
53 size + 0.5 MB) for internal I/O buffers.</para>
54 <para>It is very important to consider memory consumption when increasing
55 the thread pool size. Drives are only able to sustain a certain amount of
56 parallel I/O activity before performance is degraded, due to the high
57 number of seeks and the OST threads just waiting for I/O. In this
58 situation, it may be advisable to decrease the load by decreasing the
59 number of OST threads.</para>
60 <para>Determining the optimum number of OSS threads is a process of trial
61 and error, and varies for each particular configuration. Variables include
62 the number of OSTs on each OSS, number and speed of disks, RAID
63 configuration, and available RAM. You may want to start with a number of
64 OST threads equal to the number of actual disk spindles on the node. If you
65 use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
66 of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
67 performance of clients during usual workloads. If performance is degraded,
68 increase the thread count and see how that works until performance is
69 degraded again or you reach satisfactory performance.</para>
71 <para>If there are too many threads, the latency for individual I/O
72 requests can become very high and should be avoided. Set the desired
73 maximum thread count permanently using the method described above.</para>
78 <primary>tuning</primary>
79 <secondary>OSS threads</secondary>
80 </indexterm>Specifying the OSS Service Thread Count</title>
82 <literal>oss_num_threads</literal> parameter enables the number of OST
83 service threads to be specified at module load time on the OSS
86 options ost oss_num_threads={N}
88 <para>After startup, the minimum and maximum number of OSS thread counts
90 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
91 the tunable at runtime, run:</para>
94 lctl {get,set}_param {service}.thread_{min,max,started}
98 This works in a similar fashion to
99 binding of threads on MDS. MDS thread tuning is covered in
100 <xref linkend="dbdoclet.mdsbinding" />.</para>
104 <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
106 <literal>[EXPRESSION]</literal>.</para>
110 <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
112 <literal>[EXPRESSION]</literal>.</para>
115 <para>For further details, see
116 <xref linkend="dbdoclet.50438271_87260" />.</para>
118 <section xml:id="dbdoclet.mdstuning">
121 <primary>tuning</primary>
122 <secondary>MDS threads</secondary>
123 </indexterm>Specifying the MDS Service Thread Count</title>
125 <literal>mds_num_threads</literal> parameter enables the number of MDS
126 service threads to be specified at module load time on the MDS
128 <screen>options mds mds_num_threads={N}</screen>
129 <para>After startup, the minimum and maximum number of MDS thread counts
131 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
132 the tunable at runtime, run:</para>
135 lctl {get,set}_param {service}.thread_{min,max,started}
138 <para>For details, see
139 <xref linkend="dbdoclet.50438271_87260" />.</para>
140 <para>The number of MDS service threads started depends on system size
141 and the load on the server, and has a default maximum of 64. The
142 maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
145 <para>The OSS and MDS start two threads per service per CPT at mount
146 time, and dynamically increase the number of running service threads in
147 response to server load. Setting the <literal>*_num_threads</literal>
148 module parameter starts the specified number of threads for that
149 service immediately and disables automatic thread creation behavior.
152 <para condition='l23'>Lustre software release 2.3 introduced new
153 parameters to provide more control to administrators.</para>
157 <literal>mds_rdpg_num_threads</literal> controls the number of threads
158 in providing the read page service. The read page service handles
159 file close and readdir operations.</para>
163 <literal>mds_attr_num_threads</literal> controls the number of threads
164 in providing the setattr service to clients running Lustre software
170 <section xml:id="dbdoclet.mdsbinding" condition='l23'>
173 <primary>tuning</primary>
174 <secondary>MDS binding</secondary>
175 </indexterm>Binding MDS Service Thread to CPU Partitions</title>
176 <para>With the introduction of Node Affinity (
177 <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
178 can be bound to particular CPU partitions (CPTs) to improve CPU cache
179 usage and memory locality. Default values for CPT counts and CPU core
180 bindings are selected automatically to provide good overall performance for
181 a given CPU count. However, an administrator can deviate from these setting
182 if they choose. For details on specifying the mapping of CPU cores to
183 CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
188 <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
189 service threads to CPTs defined by
190 <literal>EXPRESSION</literal>. For example
191 <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
193 <literal>CPT[0,1,2,3]</literal>.</para>
197 <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
198 service threads to CPTs defined by
199 <literal>EXPRESSION</literal>. The read page service handles file close
200 and readdir requests. For example
201 <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
203 <literal>CPT4</literal>.</para>
207 <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
208 service threads to CPTs defined by
209 <literal>EXPRESSION</literal>.</para>
212 <para>Parameters must be set before module load in the file
213 <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
214 <example><title>lustre.conf</title>
215 <screen>options lnet networks=tcp0(eth0)
216 options mdt mds_num_cpts=[0]</screen>
220 <section xml:id="dbdoclet.50438272_73839">
223 <primary>LNet</primary>
224 <secondary>tuning</secondary>
227 <primary>tuning</primary>
228 <secondary>LNet</secondary>
229 </indexterm>Tuning LNet Parameters</title>
230 <para>This section describes LNet tunables, the use of which may be
231 necessary on some systems to improve performance. To test the performance
232 of your Lustre network, see
233 <xref linkend='lnetselftest' />.</para>
235 <title>Transmit and Receive Buffer Size</title>
236 <para>The kernel allocates buffers for sending and receiving messages on
239 <literal>ksocklnd</literal> has separate parameters for the transmit and
240 receive buffers.</para>
242 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
244 <para>If these parameters are left at the default value (0), the system
245 automatically tunes the transmit and receive buffer size. In almost every
246 case, this default produces the best performance. Do not attempt to tune
247 these parameters unless you are a network expert.</para>
250 <title>Hardware Interrupts (
251 <literal>enable_irq_affinity</literal>)</title>
252 <para>The hardware interrupts that are generated by network adapters may
253 be handled by any CPU in the system. In some cases, we would like network
254 traffic to remain local to a single CPU to help keep the processor cache
255 warm and minimize the impact of context switches. This is helpful when an
256 SMP system has more than one network interface and ideal when the number
257 of interfaces equals the number of CPUs. To enable the
258 <literal>enable_irq_affinity</literal> parameter, enter:</para>
260 options ksocklnd enable_irq_affinity=1
262 <para>In other cases, if you have an SMP platform with a single fast
263 interface such as 10 Gb Ethernet and more than two CPUs, you may see
264 performance improve by turning this parameter off.</para>
266 options ksocklnd enable_irq_affinity=0
268 <para>By default, this parameter is off. As always, you should test the
269 performance to compare the impact of changing this parameter.</para>
271 <section condition='l23'>
274 <primary>tuning</primary>
275 <secondary>Network interface binding</secondary>
276 </indexterm>Binding Network Interface Against CPU Partitions</title>
277 <para>Lustre software release 2.3 and beyond provide enhanced network
278 interface control. The enhancement means that an administrator can bind
279 an interface to one or more CPU partitions. Bindings are specified as
280 options to the LNet modules. For more information on specifying module
282 <xref linkend="dbdoclet.50438293_15350" /></para>
284 <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
285 <literal>o2ib0</literal> will be handled by LND threads executing on
286 <literal>CPT0</literal> and
287 <literal>CPT1</literal>. An additional example might be:
288 <literal>tcp1(eth0)[0]</literal>. Messages for
289 <literal>tcp1</literal> are handled by threads on
290 <literal>CPT0</literal>.</para>
295 <primary>tuning</primary>
296 <secondary>Network interface credits</secondary>
297 </indexterm>Network Interface Credits</title>
298 <para>Network interface (NI) credits are shared across all CPU partitions
299 (CPT). For example, if a machine has four CPTs and the number of NI
300 credits is 512, then each partition has 128 credits. If a large number of
301 CPTs exist on the system, LNet checks and validates the NI credits for
302 each CPT to ensure each CPT has a workable number of credits. For
303 example, if a machine has 16 CPTs and the number of NI credits is 256,
304 then each partition only has 16 credits. 16 NI credits is low and could
305 negatively impact performance. As a result, LNet automatically adjusts
307 <literal>peer_credits</literal>(
308 <literal>peer_credits</literal> is 8 by default), so each partition has 64
310 <para>Increasing the number of
311 <literal>credits</literal>/
312 <literal>peer_credits</literal> can improve the performance of high
313 latency networks (at the cost of consuming more memory) by enabling LNet
314 to send more inflight messages to a specific network/peer and keep the
315 pipeline saturated.</para>
316 <para>An administrator can modify the NI credit count using
317 <literal>ksoclnd</literal> or
318 <literal>ko2iblnd</literal>. In the example below, 256 credits are
319 applied to TCP connections.</para>
323 <para>Applying 256 credits to IB connections can be achieved with:</para>
327 <note condition="l23">
328 <para>In Lustre software release 2.3 and beyond, LNet may revalidate
329 the NI credits, so the administrator's request may not persist.</para>
335 <primary>tuning</primary>
336 <secondary>router buffers</secondary>
337 </indexterm>Router Buffers</title>
338 <para>When a node is set up as an LNet router, three pools of buffers are
339 allocated: tiny, small and large. These pools are allocated per CPU
340 partition and are used to buffer messages that arrive at the router to be
341 forwarded to the next hop. The three different buffer sizes accommodate
342 different size messages.</para>
343 <para>If a message arrives that can fit in a tiny buffer then a tiny
344 buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
345 small buffer, then a small buffer is used. Finally if a message does not
346 fit in either a tiny buffer or a small buffer, a large buffer is
348 <para>Router buffers are shared by all CPU partitions. For a machine with
349 a large number of CPTs, the router buffer number may need to be specified
350 manually for best performance. A low number of router buffers risks
351 starving the CPU partitions of resources.</para>
355 <literal>tiny_router_buffers</literal>: Zero payload buffers used for
356 signals and acknowledgements.</para>
360 <literal>small_router_buffers</literal>: 4 KB payload buffers for
361 small messages</para>
365 <literal>large_router_buffers</literal>: 1 MB maximum payload
366 buffers, corresponding to the recommended RPC size of 1 MB.</para>
369 <para>The default setting for router buffers typically results in
370 acceptable performance. LNet automatically sets a default value to reduce
371 the likelihood of resource starvation. The size of a router buffer can be
372 modified as shown in the example below. In this example, the size of the
373 large buffer is modified using the
374 <literal>large_router_buffers</literal> parameter.</para>
376 lnet large_router_buffers=8192
378 <note condition="l23">
379 <para>In Lustre software release 2.3 and beyond, LNet may revalidate
380 the router buffer setting, so the administrator's request may not
387 <primary>tuning</primary>
388 <secondary>portal round-robin</secondary>
389 </indexterm>Portal Round-Robin</title>
390 <para>Portal round-robin defines the policy LNet applies to deliver
391 events and messages to the upper layers. The upper layers are PLRPC
392 service or LNet selftest.</para>
393 <para>If portal round-robin is disabled, LNet will deliver messages to
394 CPTs based on a hash of the source NID. Hence, all messages from a
395 specific peer will be handled by the same CPT. This can reduce data
396 traffic between CPUs. However, for some workloads, this behavior may
397 result in poorly balancing loads across the CPU.</para>
398 <para>If portal round-robin is enabled, LNet will round-robin incoming
399 events across all CPTs. This may balance load better across the CPU but
400 can incur a cross CPU overhead.</para>
401 <para>The current policy can be changed by an administrator with
403 <replaceable>value</replaceable>>
404 /proc/sys/lnet/portal_rotor</literal>. There are four options for
406 <replaceable>value</replaceable>
411 <literal>OFF</literal>
413 <para>Disable portal round-robin on all incoming requests.</para>
417 <literal>ON</literal>
419 <para>Enable portal round-robin on all incoming requests.</para>
423 <literal>RR_RT</literal>
425 <para>Enable portal round-robin only for routed messages.</para>
429 <literal>HASH_RT</literal>
431 <para>Routed messages will be delivered to the upper layer by hash of
432 source NID (instead of NID of router.) This is the default
438 <title>LNet Peer Health</title>
439 <para>Two options are available to help determine peer health:
443 <literal>peer_timeout</literal>- The timeout (in seconds) before an
444 aliveness query is sent to a peer. For example, if
445 <literal>peer_timeout</literal> is set to
446 <literal>180sec</literal>, an aliveness query is sent to the peer
447 every 180 seconds. This feature only takes effect if the node is
448 configured as an LNet router.</para>
449 <para>In a routed environment, the
450 <literal>peer_timeout</literal> feature should always be on (set to a
451 value in seconds) on routers. If the router checker has been enabled,
452 the feature should be turned off by setting it to 0 on clients and
454 <para>For a non-routed scenario, enabling the
455 <literal>peer_timeout</literal> option provides health information
456 such as whether a peer is alive or not. For example, a client is able
457 to determine if an MGS or OST is up when it sends it a message. If a
458 response is received, the peer is alive; otherwise a timeout occurs
459 when the request is made.</para>
461 <literal>peer_timeout</literal> should be set to no less than the LND
462 timeout setting. For more information about LND timeouts, see
463 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
464 linkend="section_c24_nt5_dl" />.</para>
466 <literal>o2iblnd</literal>(IB) driver is used,
467 <literal>peer_timeout</literal> should be at least twice the value of
469 <literal>ko2iblnd</literal> keepalive option. for more information
470 about keepalive options, see
471 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
472 linkend="section_ngq_qhy_zl" />.</para>
476 <literal>avoid_asym_router_failure</literal>– When set to 1, the
477 router checker running on the client or a server periodically pings
478 all the routers corresponding to the NIDs identified in the routes
479 parameter setting on the node to determine the status of each router
480 interface. The default setting is 1. (For more information about the
481 LNet routes parameter, see
482 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
483 linkend="dbdoclet.50438216_71227" /></para>
484 <para>A router is considered down if any of its NIDs are down. For
485 example, router X has three NIDs:
486 <literal>Xnid1</literal>,
487 <literal>Xnid2</literal>, and
488 <literal>Xnid3</literal>. A client is connected to the router via
489 <literal>Xnid1</literal>. The client has router checker enabled. The
490 router checker periodically sends a ping to the router via
491 <literal>Xnid1</literal>. The router responds to the ping with the
492 status of each of its NIDs. In this case, it responds with
493 <literal>Xnid1=up</literal>,
494 <literal>Xnid2=up</literal>,
495 <literal>Xnid3=down</literal>. If
496 <literal>avoid_asym_router_failure==1</literal>, the router is
497 considered down if any of its NIDs are down, so router X is
498 considered down and will not be used for routing messages. If
499 <literal>avoid_asym_router_failure==0</literal>, router X will
500 continue to be used for routing messages.</para>
502 </itemizedlist></para>
503 <para>The following router checker parameters must be set to the maximum
504 value of the corresponding setting for this option on any client or
509 <literal>dead_router_check_interval</literal>
514 <literal>live_router_check_interval</literal>
519 <literal>router_ping_timeout</literal>
522 </itemizedlist></para>
523 <para>For example, the
524 <literal>dead_router_check_interval</literal> parameter on any router must
528 <section xml:id="dbdoclet.libcfstuning" condition='l23'>
531 <primary>tuning</primary>
532 <secondary>libcfs</secondary>
533 </indexterm>libcfs Tuning</title>
534 <para>Lustre software release 2.3 introduced binding service threads via
535 CPU Partition Tables (CPTs). This allows the system administrator to
536 fine-tune on which CPU cores the Lustre service threads are run, for both
537 OSS and MDS services, as well as on the client.
539 <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
540 system functions such as system monitoring, HA heartbeat, or similar
541 tasks. On the client it may be useful to restrict Lustre RPC service
542 threads to a small subset of cores so that they do not interfere with
543 computation, or because these cores are directly attached to the network
546 <para>By default, the Lustre software will automatically generate CPU
547 partitions (CPT) based on the number of CPUs in the system.
548 The CPT count can be explicitly set on the libcfs module using
549 <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
550 The value of <literal>cpu_npartitions</literal> must be an integer between
551 1 and the number of online CPUs.
553 <para condition='l29'>In Lustre 2.9 and later the default is to use
554 one CPT per NUMA node. In earlier versions of Lustre, by default there
555 was a single CPT if the online CPU core count was four or fewer, and
556 additional CPTs would be created depending on the number of CPU cores,
557 typically with 4-8 cores per CPT.
560 <para>Setting <literal>cpu_npartitions=1</literal> will disable most
561 of the SMP Node Affinity functionality.</para>
564 <title>CPU Partition String Patterns</title>
565 <para>CPU partitions can be described using string pattern notation.
566 If <literal>cpu_pattern=N</literal> is used, then there will be one
567 CPT for each NUMA node in the system, with each CPT mapping all of
568 the CPU cores for that NUMA node.
570 <para>It is also possible to explicitly specify the mapping between
571 CPU cores and CPTs, for example:</para>
575 <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
577 <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
578 contains cores 3, 5, 7. CPU cores 0 and 1 will not be used by Lustre
579 service threads, and could be used for node services such as
580 system monitoring, HA heartbeat threads, etc. The binding of
581 non-Lustre services to those CPU cores may be done in userspace
582 using <literal>numactl(8)</literal> or other application-specific
583 methods, but is beyond the scope of this document.</para>
587 <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
589 <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
590 node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
593 <para>The current configuration of the CPU partition can be read via
594 <literal>lctl get_parm cpu_partition_table</literal>. For example,
595 a simple 4-core system has a single CPT with all four CPU cores:
596 <screen>$ lctl get_param cpu_partition_table
597 cpu_partition_table=0 : 0 1 2 3</screen>
598 while a larger NUMA system with four 12-core CPUs may have four CPTs:
599 <screen>$ lctl get_param cpu_partition_table
601 0 : 0 1 2 3 4 5 6 7 8 9 10 11
602 1 : 12 13 14 15 16 17 18 19 20 21 22 23
603 2 : 24 25 26 27 28 29 30 31 32 33 34 35
604 3 : 36 37 38 39 40 41 42 43 44 45 46 47
609 <section xml:id="dbdoclet.lndtuning">
612 <primary>tuning</primary>
613 <secondary>LND tuning</secondary>
614 </indexterm>LND Tuning</title>
615 <para>LND tuning allows the number of threads per CPU partition to be
616 specified. An administrator can set the threads for both
617 <literal>ko2iblnd</literal> and
618 <literal>ksocklnd</literal> using the
619 <literal>nscheds</literal> parameter. This adjusts the number of threads for
620 each partition, not the overall number of threads on the LND.</para>
622 <para>Lustre software release 2.3 has greatly decreased the default
623 number of threads for
624 <literal>ko2iblnd</literal> and
625 <literal>ksocklnd</literal> on high-core count machines. The current
626 default values are automatically set and are chosen to work well across a
627 number of typical scenarios.</para>
630 <title>ko2iblnd Tuning</title>
631 <para>The following table outlines the ko2iblnd module parameters to be used
633 <informaltable frame="all">
635 <colspec colname="c1" colwidth="50*" />
636 <colspec colname="c2" colwidth="50*" />
637 <colspec colname="c3" colwidth="50*" />
642 <emphasis role="bold">Module Parameter</emphasis>
647 <emphasis role="bold">Default Value</emphasis>
652 <emphasis role="bold">Description</emphasis>
661 <literal>service</literal>
666 <literal>987</literal>
670 <para>Service number (within RDMA_PS_TCP).</para>
676 <literal>cksum</literal>
685 <para>Set non-zero to enable message (not RDMA) checksums.</para>
691 <literal>timeout</literal>
696 <literal>50</literal>
700 <para>Timeout in seconds.</para>
706 <literal>nscheds</literal>
715 <para>Number of threads in each scheduler pool (per CPT). Value of
716 zero means we derive the number from the number of cores.</para>
722 <literal>conns_per_peer</literal>
727 <literal>4 (OmniPath), 1 (Everything else)</literal>
731 <para>Introduced in 2.10. Number of connections to each peer. Messages
732 are sent round-robin over the connection pool. Provides signifiant
733 improvement with OmniPath.</para>
739 <literal>ntx</literal>
744 <literal>512</literal>
748 <para>Number of message descriptors allocated for each pool at
749 startup. Grows at runtime. Shared by all CPTs.</para>
755 <literal>credits</literal>
760 <literal>256</literal>
764 <para>Number of concurrent sends on network.</para>
770 <literal>peer_credits</literal>
779 <para>Number of concurrent sends to 1 peer. Related/limited by IB
786 <literal>peer_credits_hiw</literal>
795 <para>When eagerly to return credits.</para>
801 <literal>peer_buffer_credits</literal>
810 <para>Number per-peer router buffer credits.</para>
816 <literal>peer_timeout</literal>
821 <literal>180</literal>
825 <para>Seconds without aliveness news to declare peer dead (less than
826 or equal to 0 to disable).</para>
832 <literal>ipif_name</literal>
837 <literal>ib0</literal>
841 <para>IPoIB interface name.</para>
847 <literal>retry_count</literal>
856 <para>Retransmissions when no ACK received.</para>
862 <literal>rnr_retry_count</literal>
871 <para>RNR retransmissions.</para>
877 <literal>keepalive</literal>
882 <literal>100</literal>
886 <para>Idle time in seconds before sending a keepalive.</para>
892 <literal>ib_mtu</literal>
901 <para>IB MTU 256/512/1024/2048/4096.</para>
907 <literal>concurrent_sends</literal>
916 <para>Send work-queue sizing. If zero, derived from
917 <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
924 <literal>map_on_demand</literal>
929 <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
933 <para>Number of fragments reserved for connection. If zero, use
934 global memory region (found to be security issue). If non-zero, use
935 FMR or FastReg for memory registration. Value needs to agree between
936 both peers of connection.</para>
942 <literal>fmr_pool_size</literal>
947 <literal>512</literal>
951 <para>Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime.
958 <literal>fmr_flush_trigger</literal>
963 <literal>384</literal>
967 <para>Number dirty FMRs that triggers pool flush.</para>
973 <literal>fmr_cache</literal>
982 <para>Non-zero to enable FMR caching.</para>
988 <literal>dev_failover</literal>
997 <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
1004 <literal>require_privileged_port</literal>
1009 <literal>0</literal>
1013 <para>Require privileged port when accepting connection.</para>
1019 <literal>use_privileged_port</literal>
1024 <literal>1</literal>
1028 <para>Use privileged port when initiating connection.</para>
1034 <literal>wrq_sge</literal>
1039 <literal>2</literal>
1043 <para>Introduced in 2.10. Number scatter/gather element groups per
1044 work request. Used to deal with fragmentations which can consume
1045 double the number of work requests.</para>
1053 <section xml:id="dbdoclet.nrstuning" condition='l24'>
1056 <primary>tuning</primary>
1057 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1058 </indexterm>Network Request Scheduler (NRS) Tuning</title>
1059 <para>The Network Request Scheduler (NRS) allows the administrator to
1060 influence the order in which RPCs are handled at servers, on a per-PTLRPC
1061 service basis, by providing different policies that can be activated and
1062 tuned in order to influence the RPC ordering. The aim of this is to provide
1063 for better performance, and possibly discrete performance characteristics
1064 using future policies.</para>
1065 <para>The NRS policy state of a PTLRPC service can be read and set via the
1066 <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
1067 service's NRS policy state, run:</para>
1069 lctl get_param {service}.nrs_policies
1071 <para>For example, to read the NRS policy state of the
1072 <literal>ost_io</literal> service, run:</para>
1074 $ lctl get_param ost.OSS.ost_io.nrs_policies
1075 ost.OSS.ost_io.nrs_policies=
1102 high_priority_requests:
1128 <para>NRS policy state is shown in either one or two sections, depending on
1129 the PTLRPC service being queried. The first section is named
1130 <literal>regular_requests</literal> and is available for all PTLRPC
1131 services, optionally followed by a second section which is named
1132 <literal>high_priority_requests</literal>. This is because some PTLRPC
1133 services are able to treat some types of RPCs as higher priority ones, such
1134 that they are handled by the server with higher priority compared to other,
1135 regular RPC traffic. For PTLRPC services that do not support high-priority
1136 RPCs, you will only see the
1137 <literal>regular_requests</literal> section.</para>
1138 <para>There is a separate instance of each NRS policy on each PTLRPC
1139 service for handling regular and high-priority RPCs (if the service
1140 supports high-priority RPCs). For each policy instance, the following
1141 fields are shown:</para>
1142 <informaltable frame="all">
1144 <colspec colname="c1" colwidth="50*" />
1145 <colspec colname="c2" colwidth="50*" />
1150 <emphasis role="bold">Field</emphasis>
1155 <emphasis role="bold">Description</emphasis>
1164 <literal>name</literal>
1168 <para>The name of the policy.</para>
1174 <literal>state</literal>
1178 <para>The state of the policy; this can be any of
1179 <literal>invalid, stopping, stopped, starting, started</literal>.
1180 A fully enabled policy is in the
1181 <literal>started</literal> state.</para>
1187 <literal>fallback</literal>
1191 <para>Whether the policy is acting as a fallback policy or not. A
1192 fallback policy is used to handle RPCs that other enabled
1193 policies fail to handle, or do not support the handling of. The
1195 <literal>no, yes</literal>. Currently, only the FIFO policy can
1196 act as a fallback policy.</para>
1202 <literal>queued</literal>
1206 <para>The number of RPCs that the policy has waiting to be
1213 <literal>active</literal>
1217 <para>The number of RPCs that the policy is currently
1224 <para>To enable an NRS policy on a PTLRPC service run:</para>
1226 lctl set_param {service}.nrs_policies=
1227 <replaceable>policy_name</replaceable>
1229 <para>This will enable the policy
1230 <replaceable>policy_name</replaceable>for both regular and high-priority
1231 RPCs (if the PLRPC service supports high-priority RPCs) on the given
1232 service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
1233 service, run:</para>
1235 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
1236 ldlm.services.ldlm_cbd.nrs_policies=crrn
1239 <para>For PTLRPC services that support high-priority RPCs, you can also
1241 <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
1242 for handling only regular or high-priority RPCs on a given PTLRPC service,
1245 lctl set_param {service}.nrs_policies="
1246 <replaceable>policy_name</replaceable>
1247 <replaceable>reg|hp</replaceable>"
1249 <para>For example, to enable the TRR policy for handling only regular, but
1250 not high-priority RPCs on the
1251 <literal>ost_io</literal> service, run:</para>
1253 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
1254 ost.OSS.ost_io.nrs_policies="trr reg"
1258 <para>When enabling an NRS policy, the policy name must be given in
1259 lower-case characters, otherwise the operation will fail with an error
1265 <primary>tuning</primary>
1266 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1267 <tertiary>first in, first out (FIFO) policy</tertiary>
1268 </indexterm>First In, First Out (FIFO) policy</title>
1269 <para>The first in, first out (FIFO) policy handles RPCs in a service in
1270 the same order as they arrive from the LNet layer, so no special
1271 processing takes place to modify the RPC handling stream. FIFO is the
1272 default policy for all types of RPCs on all PTLRPC services, and is
1273 always enabled irrespective of the state of other policies, so that it
1274 can be used as a backup policy, in case a more elaborate policy that has
1275 been enabled fails to handle an RPC, or does not support handling a given
1277 <para>The FIFO policy has no tunables that adjust its behaviour.</para>
1282 <primary>tuning</primary>
1283 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1284 <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
1285 </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
1286 <para>The client round-robin over NIDs (CRR-N) policy performs batched
1287 round-robin scheduling of all types of RPCs, with each batch consisting
1288 of RPCs originating from the same client node, as identified by its NID.
1289 CRR-N aims to provide for better resource utilization across the cluster,
1290 and to help shorten completion times of jobs in some cases, by
1291 distributing available bandwidth more evenly across all clients.</para>
1292 <para>The CRR-N policy can be enabled on all types of PTLRPC services,
1293 and has the following tunable that can be used to adjust its
1298 <literal>{service}.nrs_crrn_quantum</literal>
1301 <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
1302 maximum allowed size of each batch of RPCs; the unit of measure is in
1303 number of RPCs. To read the maximum allowed batch size of a CRR-N
1306 lctl get_param {service}.nrs_crrn_quantum
1308 <para>For example, to read the maximum allowed batch size of a CRR-N
1309 policy on the ost_io service, run:</para>
1311 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
1312 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
1316 <para>You can see that there is a separate maximum allowed batch size
1318 <literal>reg_quantum</literal>) and high-priority (
1319 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1320 high-priority RPCs).</para>
1321 <para>To set the maximum allowed batch size of a CRR-N policy on a
1322 given service, run:</para>
1324 lctl set_param {service}.nrs_crrn_quantum=
1325 <replaceable>1-65535</replaceable>
1327 <para>This will set the maximum allowed batch size on a given
1328 service, for both regular and high-priority RPCs (if the PLRPC
1329 service supports high-priority RPCs), to the indicated value.</para>
1330 <para>For example, to set the maximum allowed batch size on the
1331 ldlm_canceld service to 16 RPCs, run:</para>
1333 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1334 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1337 <para>For PTLRPC services that support high-priority RPCs, you can
1338 also specify a different maximum allowed batch size for regular and
1339 high-priority RPCs, by running:</para>
1341 $ lctl set_param {service}.nrs_crrn_quantum=
1342 <replaceable>reg_quantum|hp_quantum</replaceable>:
1343 <replaceable>1-65535</replaceable>"
1345 <para>For example, to set the maximum allowed batch size on the
1346 ldlm_canceld service, for high-priority RPCs to 32, run:</para>
1348 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
1349 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
1352 <para>By using the last method, you can also set the maximum regular
1353 and high-priority RPC batch sizes to different values, in a single
1354 command invocation.</para>
1361 <primary>tuning</primary>
1362 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1363 <tertiary>object-based round-robin (ORR) policy</tertiary>
1364 </indexterm>Object-based Round-Robin (ORR) policy</title>
1365 <para>The object-based round-robin (ORR) policy performs batched
1366 round-robin scheduling of bulk read write (brw) RPCs, with each batch
1367 consisting of RPCs that pertain to the same backend-file system object,
1368 as identified by its OST FID.</para>
1369 <para>The ORR policy is only available for use on the ost_io service. The
1370 RPC batches it forms can potentially consist of mixed bulk read and bulk
1371 write RPCs. The RPCs in each batch are ordered in an ascending manner,
1372 based on either the file offsets, or the physical disk offsets of each
1373 RPC (only applicable to bulk read RPCs).</para>
1374 <para>The aim of the ORR policy is to provide for increased bulk read
1375 throughput in some cases, by ordering bulk read RPCs (and potentially
1376 bulk write RPCs), and thus minimizing costly disk seek operations.
1377 Performance may also benefit from any resulting improvement in resource
1378 utilization, or by taking advantage of better locality of reference
1379 between RPCs.</para>
1380 <para>The ORR policy has the following tunables that can be used to
1381 adjust its behaviour:</para>
1385 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
1388 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
1389 the maximum allowed size of each batch of RPCs; the unit of measure
1390 is in number of RPCs. To read the maximum allowed batch size of the
1391 ORR policy, run:</para>
1393 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
1394 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
1398 <para>You can see that there is a separate maximum allowed batch size
1400 <literal>reg_quantum</literal>) and high-priority (
1401 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1402 high-priority RPCs).</para>
1403 <para>To set the maximum allowed batch size for the ORR policy,
1406 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1407 <replaceable>1-65535</replaceable>
1409 <para>This will set the maximum allowed batch size for both regular
1410 and high-priority RPCs, to the indicated value.</para>
1411 <para>You can also specify a different maximum allowed batch size for
1412 regular and high-priority RPCs, by running:</para>
1414 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1415 <replaceable>reg_quantum|hp_quantum</replaceable>:
1416 <replaceable>1-65535</replaceable>
1418 <para>For example, to set the maximum allowed batch size for regular
1419 RPCs to 128, run:</para>
1421 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1422 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1425 <para>By using the last method, you can also set the maximum regular
1426 and high-priority RPC batch sizes to different values, in a single
1427 command invocation.</para>
1431 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
1434 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
1435 determines whether the ORR policy orders RPCs within each batch based
1436 on logical file offsets or physical disk offsets. To read the offset
1437 type value for the ORR policy, run:</para>
1439 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
1440 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
1441 hp_offset_type:logical
1444 <para>You can see that there is a separate offset type value for
1446 <literal>reg_offset_type</literal>) and high-priority (
1447 <literal>hp_offset_type</literal>) RPCs.</para>
1448 <para>To set the ordering type for the ORR policy, run:</para>
1450 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1451 <replaceable>physical|logical</replaceable>
1453 <para>This will set the offset type for both regular and
1454 high-priority RPCs, to the indicated value.</para>
1455 <para>You can also specify a different offset type for regular and
1456 high-priority RPCs, by running:</para>
1458 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1459 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
1460 <replaceable>physical|logical</replaceable>
1462 <para>For example, to set the offset type for high-priority RPCs to
1463 physical disk offsets, run:</para>
1465 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1466 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1468 <para>By using the last method, you can also set offset type for
1469 regular and high-priority RPCs to different values, in a single
1470 command invocation.</para>
1472 <para>Irrespective of the value of this tunable, only logical
1473 offsets can, and are used for ordering bulk write RPCs.</para>
1478 <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1481 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1482 the type of RPCs that the ORR policy will handle. To read the types
1483 of supported RPCs by the ORR policy, run:</para>
1485 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1486 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1487 hp_supported=reads_and_writes
1490 <para>You can see that there is a separate supported 'RPC types'
1492 <literal>reg_supported</literal>) and high-priority (
1493 <literal>hp_supported</literal>) RPCs.</para>
1494 <para>To set the supported RPC types for the ORR policy, run:</para>
1496 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1497 <replaceable>reads|writes|reads_and_writes</replaceable>
1499 <para>This will set the supported RPC types for both regular and
1500 high-priority RPCs, to the indicated value.</para>
1501 <para>You can also specify a different supported 'RPC types' value
1502 for regular and high-priority RPCs, by running:</para>
1504 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1505 <replaceable>reg_supported|hp_supported</replaceable>:
1506 <replaceable>reads|writes|reads_and_writes</replaceable>
1508 <para>For example, to set the supported RPC types to bulk read and
1509 bulk write RPCs for regular requests, run:</para>
1512 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1513 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1516 <para>By using the last method, you can also set the supported RPC
1517 types for regular and high-priority RPC to different values, in a
1518 single command invocation.</para>
1525 <primary>tuning</primary>
1526 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1527 <tertiary>Target-based round-robin (TRR) policy</tertiary>
1528 </indexterm>Target-based Round-Robin (TRR) policy</title>
1529 <para>The target-based round-robin (TRR) policy performs batched
1530 round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1531 that pertain to the same OST, as identified by its OST index.</para>
1532 <para>The TRR policy is identical to the object-based round-robin (ORR)
1533 policy, apart from using the brw RPC's target OST index instead of the
1534 backend-fs object's OST FID, for determining the RPC scheduling order.
1535 The goals of TRR are effectively the same as for ORR, and it uses the
1536 following tunables to adjust its behaviour:</para>
1540 <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1542 <para>The purpose of this tunable is exactly the same as for the
1543 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1544 policy, and you can use it in exactly the same way.</para>
1548 <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1550 <para>The purpose of this tunable is exactly the same as for the
1551 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1552 ORR policy, and you can use it in exactly the same way.</para>
1556 <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1558 <para>The purpose of this tunable is exactly the same as for the
1559 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1560 ORR policy, and you can use it in exactly the sme way.</para>
1564 <section xml:id="dbdoclet.tbftuning" condition='l26'>
1567 <primary>tuning</primary>
1568 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1569 <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1570 </indexterm>Token Bucket Filter (TBF) policy</title>
1571 <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1572 Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1573 (Quality of Service) purposes.</para>
1575 <title>The internal structure of TBF policy</title>
1578 <imagedata scalefit="1" width="100%"
1579 fileref="figures/TBF_policy.svg" />
1582 <phrase>The internal structure of TBF policy</phrase>
1586 <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1587 according to its classification. The classification of RPC requests is
1588 based on either NID or JobID of the RPC according to the configure of
1589 TBF. TBF policy maintains multiple queues in the system, one queue for
1590 each category in the classification of RPC requests. The requests waits
1591 for tokens in the FIFO queue before they have been handled so as to keep
1592 the RPC rates under the limits.</para>
1593 <para>When Lustre services are too busy to handle all of the requests in
1594 time, all of the specified rates of the queues will not be satisfied.
1595 Nothing bad will happen except some of the RPC rates are slower than
1596 configured. In this case, the queue with higher rate will have an
1597 advantage over the queues with lower rates, but none of them will be
1599 <para>To manage the RPC rate of queues, we don't need to set the rate of
1600 each queue manually. Instead, we define rules which TBF policy matches to
1601 determine RPC rate limits. All of the defined rules are organized as an
1602 ordered list. Whenever a queue is newly created, it goes though the rule
1603 list and takes the first matched rule as its rule, so that the queue
1604 knows its RPC token rate. A rule can be added to or removed from the list
1605 at run time. Whenever the list of rules is changed, the queues will
1606 update their matched rules.</para>
1610 <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
1612 <para>The format of the rule start command of TBF policy is as
1615 $ lctl set_param x.x.x.nrs_tbf_rule=
1616 "[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
1619 <replaceable>rule_name</replaceable>' argument is a string which
1620 identifies a rule. The format of the '
1621 <replaceable>arguments</replaceable>' is changing according to the
1622 type of the TBF policy. For the NID based TBF policy, its format is
1625 $ lctl set_param x.x.x.nrs_tbf_rule=
1626 "[reg|hp] start <replaceable>rule_name</replaceable> {<replaceable>nidlist</replaceable>} <replaceable>rate</replaceable>"
1628 <para>The format of '
1629 <replaceable>nidlist</replaceable>' argument is the same as the
1630 format when configuring LNet route. The '
1631 <replaceable>rate</replaceable>' argument is the RPC rate of the
1632 rule, means the upper limit number of requests per second.</para>
1633 <para>Following commands are valid. Please note that a newly started
1634 rule is prior to old rules, so the order of starting rules is
1635 critical too.</para>
1637 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1638 "start other_clients {192.168.*.*@tcp} 50"
1641 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1642 "start loginnode {192.168.1.1@tcp} 100"
1644 <para>General rule can be replaced by two rules (reg and hp) as
1647 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1648 "reg start loginnode {192.168.1.1@tcp} 100"
1651 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1652 "hp start loginnode {192.168.1.1@tcp} 100"
1655 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1656 "start computes {192.168.1.[2-128]@tcp} 500"
1658 <para>The above rules will put an upper limit for servers to process
1659 at most 5x as many RPCs from compute nodes as login nodes.</para>
1660 <para>For the JobID (please see
1661 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1662 linkend="dbdoclet.jobstats" /> for more details) based TBF
1663 policy, its format is as follows:</para>
1665 $ lctl set_param x.x.x.nrs_tbf_rule=
1666 "[reg|hp] start <replaceable>name</replaceable> {<replaceable>jobid_list</replaceable>} <replaceable>rate</replaceable>"
1668 <para>Following commands are valid:</para>
1670 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1671 "start user1 {iozone.500 dd.500} 100"
1674 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1675 "start iozone_user1 {iozone.500} 100"
1677 <para>Same as nid, could use reg and hp rules separately:</para>
1679 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1680 "hp start iozone_user1 {iozone.500} 100"
1683 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1684 "reg start iozone_user1 {iozone.500} 100"
1686 <para>The format of the rule change command of TBF policy is as
1689 $ lctl set_param x.x.x.nrs_tbf_rule=
1690 "[reg|hp] change <replaceable>rule_name</replaceable> <replaceable>rate</replaceable>"
1692 <para>Following commands are valid:</para>
1694 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
1697 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
1700 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
1702 <para>The format of the rule stop command of TBF policy is as
1705 $ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1706 <replaceable>rule_name</replaceable>"
1708 <para>Following commands are valid:</para>
1710 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1713 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1716 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
1722 <section xml:id="dbdoclet.50438272_25884">
1725 <primary>tuning</primary>
1726 <secondary>lockless I/O</secondary>
1727 </indexterm>Lockless I/O Tunables</title>
1728 <para>The lockless I/O tunable feature allows servers to ask clients to do
1729 lockless I/O (the server does the locking on behalf of clients) for
1730 contended files to avoid lock ping-pong.</para>
1731 <para>The lockless I/O patch introduces these tunables:</para>
1735 <emphasis role="bold">OST-side:</emphasis>
1738 ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
1741 <literal>contended_locks</literal>- If the number of lock conflicts in
1742 the scan of granted and waiting queues at contended_locks is exceeded,
1743 the resource is considered to be contended.</para>
1745 <literal>contention_seconds</literal>- The resource keeps itself in a
1746 contended state as set in the parameter.</para>
1748 <literal>max_nolock_bytes</literal>- Server-side locking set only for
1749 requests less than the blocks set in the
1750 <literal>max_nolock_bytes</literal> parameter. If this tunable is
1751 set to zero (0), it disables server-side locking for read/write
1756 <emphasis role="bold">Client-side:</emphasis>
1759 /proc/fs/lustre/llite/lustre-*
1762 <literal>contention_seconds</literal>-
1763 <literal>llite</literal> inode remembers its contended state for the
1764 time specified in this parameter.</para>
1768 <emphasis role="bold">Client-side statistics:</emphasis>
1771 <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
1772 rows for lockless I/O statistics.</para>
1774 <literal>lockless_read_bytes</literal> and
1775 <literal>lockless_write_bytes</literal>- To count the total bytes read
1776 or written, the client makes its own decisions based on the request
1777 size. The client does not communicate with the server if the request
1778 size is smaller than the
1779 <literal>min_nolock_size</literal>, without acquiring locks by the
1784 <section condition="l29">
1787 <primary>tuning</primary>
1788 <secondary>with lfs ladvise</secondary>
1790 Server-Side Advice and Hinting
1792 <section><title>Overview</title>
1793 <para>Use the <literal>lfs ladvise</literal> command give file access
1794 advices or hints to servers.</para>
1795 <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
1796 [--start|-s START[kMGT]]
1797 {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
1798 <emphasis>file</emphasis> ...
1801 <informaltable frame="all">
1803 <colspec colname="c1" colwidth="50*"/>
1804 <colspec colname="c2" colwidth="50*"/>
1808 <para><emphasis role="bold">Option</emphasis></para>
1811 <para><emphasis role="bold">Description</emphasis></para>
1818 <para><literal>-a</literal>, <literal>--advice=</literal>
1819 <literal>ADVICE</literal></para>
1822 <para>Give advice or hint of type <literal>ADVICE</literal>.
1823 Advice types are:</para>
1824 <para><literal>willread</literal> to prefetch data into server
1826 <para><literal>dontneed</literal> to cleanup data cache on
1832 <para><literal>-b</literal>, <literal>--background</literal>
1836 <para>Enable the advices to be sent and handled asynchronously.
1842 <para><literal>-s</literal>, <literal>--start=</literal>
1843 <literal>START_OFFSET</literal></para>
1846 <para>File range starts from <literal>START_OFFSET</literal>
1852 <para><literal>-e</literal>, <literal>--end=</literal>
1853 <literal>END_OFFSET</literal></para>
1856 <para>File range ends at (not including)
1857 <literal>END_OFFSET</literal>. This option may not be
1858 specified at the same time as the <literal>-l</literal>
1864 <para><literal>-l</literal>, <literal>--length=</literal>
1865 <literal>LENGTH</literal></para>
1868 <para>File range has length of <literal>LENGTH</literal>.
1869 This option may not be specified at the same time as the
1870 <literal>-e</literal> option.</para>
1877 <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
1878 Lustre servers without guaranteeing when and what servers will react to
1879 the advice. Actions may or may not triggered when the advices are
1880 recieved, depending on the type of the advice, as well as the real-time
1881 decision of the affected server-side components.</para>
1882 <para>A typical usage of ladvise is to enable applications and users with
1883 external knowledge to intervene in server-side cache management. For
1884 example, if a bunch of different clients are doing small random reads of a
1885 file, prefetching pages into OSS cache with big linear reads before the
1886 random IO is a net benefit. Fetching that data into each client cache with
1887 fadvise() may not be, due to much more data being sent to the client.
1889 <para>The main difference between the Linux <literal>fadvise()</literal>
1890 system call and <literal>lfs ladvise</literal> is that
1891 <literal>fadvise()</literal> is only a client side mechanism that does
1892 not pass the advice to the filesystem, while <literal>ladvise</literal>
1893 can send advices or hints to the Lustre server side.</para>
1895 <section><title>Examples</title>
1896 <para>The following example gives the OST(s) holding the first 1GB of
1897 <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
1898 file will be read soon.</para>
1899 <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
1901 <para>The following example gives the OST(s) holding the first 1GB of
1902 <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
1903 will not be read in the near future, thus the OST(s) could clear the
1904 cache of the file in the memory.</para>
1905 <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
1909 <section condition="l29">
1912 <primary>tuning</primary>
1913 <secondary>Large Bulk IO</secondary>
1915 Large Bulk IO (16MB RPC)
1917 <section><title>Overview</title>
1918 <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
1919 to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
1920 required to transfer the same amount of data between clients and
1921 servers. With a larger RPC size, the OSS can submit more data to the
1922 underlying disks at once, therefore it can produce larger disk I/Os
1923 to fully utilize the increasing bandwidth of disks.</para>
1924 <para>At client connection time, clients will negotiate with
1925 servers what the maximum RPC size it is possible to use, but the
1926 client can always send RPCs smaller than this maximum.</para>
1927 <para>The parameter <literal>brw_size</literal> is used on the OST
1928 to tell the client the maximum (preferred) IO size. All clients that
1929 talk to this target should never send an RPC greater than this size.
1930 Clients can individually set a smaller RPC size limit via the
1931 <literal>osc.*.max_pages_per_rpc</literal> tunable.
1934 <para>The smallest <literal>brw_size</literal> that can be set for
1935 ZFS OSTs is the <literal>recordsize</literal> of that dataset. This
1936 ensures that the client can always write a full ZFS file block if it
1937 has enough dirty data, and does not otherwise force it to do read-
1938 modify-write operations for every RPC.
1942 <section><title>Usage</title>
1943 <para>In order to enable a larger RPC size,
1944 <literal>brw_size</literal> must be changed to an IO size value up to
1945 16MB. To temporarily change <literal>brw_size</literal>, the
1946 following command should be run on the OSS:</para>
1947 <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
1948 <para>To persistently change <literal>brw_size</literal>, one of the following
1949 commands should be run on the OSS:</para>
1950 <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
1951 <screen>oss# lctl conf_param <replaceable>fsname</replaceable>-OST*.obdfilter.brw_size=16</screen>
1952 <para>When a client connects to an OST target, it will fetch
1953 <literal>brw_size</literal> from the target and pick the maximum value
1954 of <literal>brw_size</literal> and its local setting for
1955 <literal>max_pages_per_rpc</literal> as the actual RPC size.
1956 Therefore, the <literal>max_pages_per_rpc</literal> on the client side
1957 would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
1958 a 16MB RPC. To temporarily make the change, the following command
1959 should be run on the client to set
1960 <literal>max_pages_per_rpc</literal>:</para>
1961 <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
1962 <para>To persistently make this change, the following command should
1964 <screen>client$ lctl conf_param <replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
1965 <caution><para>The <literal>brw_size</literal> of an OST can be
1966 changed on the fly. However, clients have to be remounted to
1967 renegotiate the new maximum RPC size.</para></caution>
1970 <section xml:id="dbdoclet.50438272_80545">
1973 <primary>tuning</primary>
1974 <secondary>for small files</secondary>
1975 </indexterm>Improving Lustre I/O Performance for Small Files</title>
1976 <para>An environment where an application writes small file chunks from
1977 many clients to a single file can result in poor I/O performance. To
1978 improve the performance of the Lustre file system with small files:</para>
1981 <para>Have the application aggregate writes some amount before
1982 submitting them to the Lustre file system. By default, the Lustre
1983 software enforces POSIX coherency semantics, so it results in lock
1984 ping-pong between client nodes if they are all writing to the same
1985 file at one time.</para>
1986 <para>Using MPI-IO Collective Write functionality in
1987 the Lustre ADIO driver is one way to achieve this in a straight
1988 forward manner if the application is already using MPI-IO.</para>
1991 <para>Have the application do 4kB
1992 <literal>O_DIRECT</literal> sized I/O to the file and disable locking
1993 on the output file. This avoids partial-page IO submissions and, by
1994 disabling locking, you avoid contention between clients.</para>
1997 <para>Have the application write contiguous data.</para>
2000 <para>Add more disks or use SSD disks for the OSTs. This dramatically
2001 improves the IOPS rate. Consider creating larger OSTs rather than many
2002 smaller OSTs due to less overhead (journal, connections, etc).</para>
2005 <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
2006 overhead for writing small chunks of data to disk.</para>
2010 <section xml:id="dbdoclet.50438272_45406">
2013 <primary>tuning</primary>
2014 <secondary>write performance</secondary>
2015 </indexterm>Understanding Why Write Performance is Better Than Read
2017 <para>Typically, the performance of write operations on a Lustre cluster is
2018 better than read operations. When doing writes, all clients are sending
2019 write RPCs asynchronously. The RPCs are allocated, and written to disk in
2020 the order they arrive. In many cases, this allows the back-end storage to
2021 aggregate writes efficiently.</para>
2022 <para>In the case of read operations, the reads from clients may come in a
2023 different order and need a lot of seeking to get read from the disk. This
2024 noticeably hampers the read throughput.</para>
2025 <para>Currently, there is no readahead on the OSTs themselves, though the
2026 clients do readahead. If there are lots of clients doing reads it would not
2027 be possible to do any readahead in any case because of memory consumption
2028 (consider that even a single RPC (1 MB) readahead for 1000 clients would
2029 consume 1 GB of RAM).</para>
2030 <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
2031 there is also additional CPU overhead because the client cannot receive
2032 data without copying it from the network buffers. In the write case, the
2033 client CAN send data without the additional data copy. This means that the
2034 client is more likely to become CPU-bound during reads than writes.</para>