1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
5 <title xml:id="lustretuning.title">Tuning a Lustre File System</title>
6 <para>This chapter contains information about tuning a Lustre file system for
7 better performance.</para>
9 <para>Many options in the Lustre software are set by means of kernel module
10 parameters. These parameters are contained in the
11 <literal>/etc/modprobe.d/lustre.conf</literal> file.</para>
13 <section xml:id="dbdoclet.50438272_55226">
16 <primary>tuning</primary>
19 <primary>tuning</primary>
20 <secondary>service threads</secondary>
21 </indexterm>Optimizing the Number of Service Threads</title>
22 <para>An OSS can have a minimum of two service threads and a maximum of 512
23 service threads. The number of service threads is a function of how much
24 RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus).
25 If the load on the OSS node is high, new service threads will be started in
26 order to process more requests concurrently, up to 4x the initial number of
27 threads (subject to the maximum of 512). For a 2GB 2-CPU system, the
28 default thread count is 32 and the maximum thread count is 128.</para>
29 <para>Increasing the size of the thread pool may help when:</para>
32 <para>Several OSTs are exported from a single OSS</para>
35 <para>Back-end storage is running synchronously</para>
38 <para>I/O completions take excessive time due to slow storage</para>
41 <para>Decreasing the size of the thread pool may help if:</para>
44 <para>Clients are overwhelming the storage capacity</para>
47 <para>There are lots of "slow I/O" or similar messages</para>
50 <para>Increasing the number of I/O threads allows the kernel and storage to
51 aggregate many writes together for more efficient disk I/O. The OSS thread
52 pool is shared--each thread allocates approximately 1.5 MB (maximum RPC
53 size + 0.5 MB) for internal I/O buffers.</para>
54 <para>It is very important to consider memory consumption when increasing
55 the thread pool size. Drives are only able to sustain a certain amount of
56 parallel I/O activity before performance is degraded, due to the high
57 number of seeks and the OST threads just waiting for I/O. In this
58 situation, it may be advisable to decrease the load by decreasing the
59 number of OST threads.</para>
60 <para>Determining the optimum number of OSS threads is a process of trial
61 and error, and varies for each particular configuration. Variables include
62 the number of OSTs on each OSS, number and speed of disks, RAID
63 configuration, and available RAM. You may want to start with a number of
64 OST threads equal to the number of actual disk spindles on the node. If you
65 use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N
66 of spindles for RAID5, 2 of N spindles for RAID6), and monitor the
67 performance of clients during usual workloads. If performance is degraded,
68 increase the thread count and see how that works until performance is
69 degraded again or you reach satisfactory performance.</para>
71 <para>If there are too many threads, the latency for individual I/O
72 requests can become very high and should be avoided. Set the desired
73 maximum thread count permanently using the method described above.</para>
78 <primary>tuning</primary>
79 <secondary>OSS threads</secondary>
80 </indexterm>Specifying the OSS Service Thread Count</title>
82 <literal>oss_num_threads</literal> parameter enables the number of OST
83 service threads to be specified at module load time on the OSS
86 options ost oss_num_threads={N}
88 <para>After startup, the minimum and maximum number of OSS thread counts
90 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
91 the tunable at runtime, run:</para>
94 lctl {get,set}_param {service}.thread_{min,max,started}
98 This works in a similar fashion to
99 binding of threads on MDS. MDS thread tuning is covered in
100 <xref linkend="dbdoclet.mdsbinding" />.</para>
104 <literal>oss_cpts=[EXPRESSION]</literal> binds the default OSS service
106 <literal>[EXPRESSION]</literal>.</para>
110 <literal>oss_io_cpts=[EXPRESSION]</literal> binds the IO OSS service
112 <literal>[EXPRESSION]</literal>.</para>
115 <para>For further details, see
116 <xref linkend="dbdoclet.50438271_87260" />.</para>
118 <section xml:id="dbdoclet.mdstuning">
121 <primary>tuning</primary>
122 <secondary>MDS threads</secondary>
123 </indexterm>Specifying the MDS Service Thread Count</title>
125 <literal>mds_num_threads</literal> parameter enables the number of MDS
126 service threads to be specified at module load time on the MDS
128 <screen>options mds mds_num_threads={N}</screen>
129 <para>After startup, the minimum and maximum number of MDS thread counts
131 <literal>{service}.thread_{min,max,started}</literal> tunable. To change
132 the tunable at runtime, run:</para>
135 lctl {get,set}_param {service}.thread_{min,max,started}
138 <para>For details, see
139 <xref linkend="dbdoclet.50438271_87260" />.</para>
140 <para>The number of MDS service threads started depends on system size
141 and the load on the server, and has a default maximum of 64. The
142 maximum potential number of threads (<literal>MDS_MAX_THREADS</literal>)
145 <para>The OSS and MDS start two threads per service per CPT at mount
146 time, and dynamically increase the number of running service threads in
147 response to server load. Setting the <literal>*_num_threads</literal>
148 module parameter starts the specified number of threads for that
149 service immediately and disables automatic thread creation behavior.
152 <para>Parameters are available to provide administrators control
153 over the number of service threads.</para>
157 <literal>mds_rdpg_num_threads</literal> controls the number of threads
158 in providing the read page service. The read page service handles
159 file close and readdir operations.</para>
163 <literal>mds_attr_num_threads</literal> controls the number of threads
164 in providing the setattr service to clients running Lustre software
170 <section xml:id="dbdoclet.mdsbinding">
173 <primary>tuning</primary>
174 <secondary>MDS binding</secondary>
175 </indexterm>Binding MDS Service Thread to CPU Partitions</title>
176 <para>With the Node Affinity (<xref linkend="nodeaffdef" />) feature,
177 MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU
178 cache usage and memory locality. Default values for CPT counts and CPU core
179 bindings are selected automatically to provide good overall performance for
180 a given CPU count. However, an administrator can deviate from these setting
181 if they choose. For details on specifying the mapping of CPU cores to
182 CPTs see <xref linkend="dbdoclet.libcfstuning"/>.
187 <literal>mds_num_cpts=[EXPRESSION]</literal> binds the default MDS
188 service threads to CPTs defined by
189 <literal>EXPRESSION</literal>. For example
190 <literal>mds_num_cpts=[0-3]</literal> will bind the MDS service threads
192 <literal>CPT[0,1,2,3]</literal>.</para>
196 <literal>mds_rdpg_num_cpts=[EXPRESSION]</literal> binds the read page
197 service threads to CPTs defined by
198 <literal>EXPRESSION</literal>. The read page service handles file close
199 and readdir requests. For example
200 <literal>mds_rdpg_num_cpts=[4]</literal> will bind the read page threads
202 <literal>CPT4</literal>.</para>
206 <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
207 service threads to CPTs defined by
208 <literal>EXPRESSION</literal>.</para>
211 <para>Parameters must be set before module load in the file
212 <literal>/etc/modprobe.d/lustre.conf</literal>. For example:
213 <example><title>lustre.conf</title>
214 <screen>options lnet networks=tcp0(eth0)
215 options mdt mds_num_cpts=[0]</screen>
219 <section xml:id="dbdoclet.50438272_73839">
222 <primary>LNet</primary>
223 <secondary>tuning</secondary>
226 <primary>tuning</primary>
227 <secondary>LNet</secondary>
228 </indexterm>Tuning LNet Parameters</title>
229 <para>This section describes LNet tunables, the use of which may be
230 necessary on some systems to improve performance. To test the performance
231 of your Lustre network, see
232 <xref linkend='lnetselftest' />.</para>
234 <title>Transmit and Receive Buffer Size</title>
235 <para>The kernel allocates buffers for sending and receiving messages on
238 <literal>ksocklnd</literal> has separate parameters for the transmit and
239 receive buffers.</para>
241 options ksocklnd tx_buffer_size=0 rx_buffer_size=0
243 <para>If these parameters are left at the default value (0), the system
244 automatically tunes the transmit and receive buffer size. In almost every
245 case, this default produces the best performance. Do not attempt to tune
246 these parameters unless you are a network expert.</para>
249 <title>Hardware Interrupts (
250 <literal>enable_irq_affinity</literal>)</title>
251 <para>The hardware interrupts that are generated by network adapters may
252 be handled by any CPU in the system. In some cases, we would like network
253 traffic to remain local to a single CPU to help keep the processor cache
254 warm and minimize the impact of context switches. This is helpful when an
255 SMP system has more than one network interface and ideal when the number
256 of interfaces equals the number of CPUs. To enable the
257 <literal>enable_irq_affinity</literal> parameter, enter:</para>
259 options ksocklnd enable_irq_affinity=1
261 <para>In other cases, if you have an SMP platform with a single fast
262 interface such as 10 Gb Ethernet and more than two CPUs, you may see
263 performance improve by turning this parameter off.</para>
265 options ksocklnd enable_irq_affinity=0
267 <para>By default, this parameter is off. As always, you should test the
268 performance to compare the impact of changing this parameter.</para>
273 <primary>tuning</primary>
274 <secondary>Network interface binding</secondary>
275 </indexterm>Binding Network Interface Against CPU Partitions</title>
276 <para>Lustre allows enhanced network interface control. This means that
277 an administrator can bind an interface to one or more CPU partitions.
278 Bindings are specified as options to the LNet modules. For more
279 information on specifying module options, see
280 <xref linkend="dbdoclet.50438293_15350" /></para>
282 <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
283 <literal>o2ib0</literal> will be handled by LND threads executing on
284 <literal>CPT0</literal> and
285 <literal>CPT1</literal>. An additional example might be:
286 <literal>tcp1(eth0)[0]</literal>. Messages for
287 <literal>tcp1</literal> are handled by threads on
288 <literal>CPT0</literal>.</para>
293 <primary>tuning</primary>
294 <secondary>Network interface credits</secondary>
295 </indexterm>Network Interface Credits</title>
296 <para>Network interface (NI) credits are shared across all CPU partitions
297 (CPT). For example, if a machine has four CPTs and the number of NI
298 credits is 512, then each partition has 128 credits. If a large number of
299 CPTs exist on the system, LNet checks and validates the NI credits for
300 each CPT to ensure each CPT has a workable number of credits. For
301 example, if a machine has 16 CPTs and the number of NI credits is 256,
302 then each partition only has 16 credits. 16 NI credits is low and could
303 negatively impact performance. As a result, LNet automatically adjusts
305 <literal>peer_credits</literal>(
306 <literal>peer_credits</literal> is 8 by default), so each partition has 64
308 <para>Increasing the number of
309 <literal>credits</literal>/
310 <literal>peer_credits</literal> can improve the performance of high
311 latency networks (at the cost of consuming more memory) by enabling LNet
312 to send more inflight messages to a specific network/peer and keep the
313 pipeline saturated.</para>
314 <para>An administrator can modify the NI credit count using
315 <literal>ksoclnd</literal> or
316 <literal>ko2iblnd</literal>. In the example below, 256 credits are
317 applied to TCP connections.</para>
321 <para>Applying 256 credits to IB connections can be achieved with:</para>
326 <para>LNet may revalidate the NI credits, so the administrator's
327 request may not persist.</para>
333 <primary>tuning</primary>
334 <secondary>router buffers</secondary>
335 </indexterm>Router Buffers</title>
336 <para>When a node is set up as an LNet router, three pools of buffers are
337 allocated: tiny, small and large. These pools are allocated per CPU
338 partition and are used to buffer messages that arrive at the router to be
339 forwarded to the next hop. The three different buffer sizes accommodate
340 different size messages.</para>
341 <para>If a message arrives that can fit in a tiny buffer then a tiny
342 buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
343 small buffer, then a small buffer is used. Finally if a message does not
344 fit in either a tiny buffer or a small buffer, a large buffer is
346 <para>Router buffers are shared by all CPU partitions. For a machine with
347 a large number of CPTs, the router buffer number may need to be specified
348 manually for best performance. A low number of router buffers risks
349 starving the CPU partitions of resources.</para>
353 <literal>tiny_router_buffers</literal>: Zero payload buffers used for
354 signals and acknowledgements.</para>
358 <literal>small_router_buffers</literal>: 4 KB payload buffers for
359 small messages</para>
363 <literal>large_router_buffers</literal>: 1 MB maximum payload
364 buffers, corresponding to the recommended RPC size of 1 MB.</para>
367 <para>The default setting for router buffers typically results in
368 acceptable performance. LNet automatically sets a default value to reduce
369 the likelihood of resource starvation. The size of a router buffer can be
370 modified as shown in the example below. In this example, the size of the
371 large buffer is modified using the
372 <literal>large_router_buffers</literal> parameter.</para>
374 lnet large_router_buffers=8192
377 <para>LNet may revalidate the router buffer setting, so the
378 administrator's request may not persist.</para>
384 <primary>tuning</primary>
385 <secondary>portal round-robin</secondary>
386 </indexterm>Portal Round-Robin</title>
387 <para>Portal round-robin defines the policy LNet applies to deliver
388 events and messages to the upper layers. The upper layers are PLRPC
389 service or LNet selftest.</para>
390 <para>If portal round-robin is disabled, LNet will deliver messages to
391 CPTs based on a hash of the source NID. Hence, all messages from a
392 specific peer will be handled by the same CPT. This can reduce data
393 traffic between CPUs. However, for some workloads, this behavior may
394 result in poorly balancing loads across the CPU.</para>
395 <para>If portal round-robin is enabled, LNet will round-robin incoming
396 events across all CPTs. This may balance load better across the CPU but
397 can incur a cross CPU overhead.</para>
398 <para>The current policy can be changed by an administrator with
400 <replaceable>value</replaceable>>
401 /proc/sys/lnet/portal_rotor</literal>. There are four options for
403 <replaceable>value</replaceable>
408 <literal>OFF</literal>
410 <para>Disable portal round-robin on all incoming requests.</para>
414 <literal>ON</literal>
416 <para>Enable portal round-robin on all incoming requests.</para>
420 <literal>RR_RT</literal>
422 <para>Enable portal round-robin only for routed messages.</para>
426 <literal>HASH_RT</literal>
428 <para>Routed messages will be delivered to the upper layer by hash of
429 source NID (instead of NID of router.) This is the default
435 <title>LNet Peer Health</title>
436 <para>Two options are available to help determine peer health:
440 <literal>peer_timeout</literal>- The timeout (in seconds) before an
441 aliveness query is sent to a peer. For example, if
442 <literal>peer_timeout</literal> is set to
443 <literal>180sec</literal>, an aliveness query is sent to the peer
444 every 180 seconds. This feature only takes effect if the node is
445 configured as an LNet router.</para>
446 <para>In a routed environment, the
447 <literal>peer_timeout</literal> feature should always be on (set to a
448 value in seconds) on routers. If the router checker has been enabled,
449 the feature should be turned off by setting it to 0 on clients and
451 <para>For a non-routed scenario, enabling the
452 <literal>peer_timeout</literal> option provides health information
453 such as whether a peer is alive or not. For example, a client is able
454 to determine if an MGS or OST is up when it sends it a message. If a
455 response is received, the peer is alive; otherwise a timeout occurs
456 when the request is made.</para>
458 <literal>peer_timeout</literal> should be set to no less than the LND
459 timeout setting. For more information about LND timeouts, see
460 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
461 linkend="section_c24_nt5_dl" />.</para>
463 <literal>o2iblnd</literal>(IB) driver is used,
464 <literal>peer_timeout</literal> should be at least twice the value of
466 <literal>ko2iblnd</literal> keepalive option. for more information
467 about keepalive options, see
468 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
469 linkend="section_ngq_qhy_zl" />.</para>
473 <literal>avoid_asym_router_failure</literal>– When set to 1, the
474 router checker running on the client or a server periodically pings
475 all the routers corresponding to the NIDs identified in the routes
476 parameter setting on the node to determine the status of each router
477 interface. The default setting is 1. (For more information about the
478 LNet routes parameter, see
479 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
480 linkend="lnet_module_routes" /></para>
481 <para>A router is considered down if any of its NIDs are down. For
482 example, router X has three NIDs:
483 <literal>Xnid1</literal>,
484 <literal>Xnid2</literal>, and
485 <literal>Xnid3</literal>. A client is connected to the router via
486 <literal>Xnid1</literal>. The client has router checker enabled. The
487 router checker periodically sends a ping to the router via
488 <literal>Xnid1</literal>. The router responds to the ping with the
489 status of each of its NIDs. In this case, it responds with
490 <literal>Xnid1=up</literal>,
491 <literal>Xnid2=up</literal>,
492 <literal>Xnid3=down</literal>. If
493 <literal>avoid_asym_router_failure==1</literal>, the router is
494 considered down if any of its NIDs are down, so router X is
495 considered down and will not be used for routing messages. If
496 <literal>avoid_asym_router_failure==0</literal>, router X will
497 continue to be used for routing messages.</para>
499 </itemizedlist></para>
500 <para>The following router checker parameters must be set to the maximum
501 value of the corresponding setting for this option on any client or
506 <literal>dead_router_check_interval</literal>
511 <literal>live_router_check_interval</literal>
516 <literal>router_ping_timeout</literal>
519 </itemizedlist></para>
520 <para>For example, the
521 <literal>dead_router_check_interval</literal> parameter on any router must
525 <section xml:id="dbdoclet.libcfstuning">
528 <primary>tuning</primary>
529 <secondary>libcfs</secondary>
530 </indexterm>libcfs Tuning</title>
531 <para>Lustre allows binding service threads via CPU Partition Tables
532 (CPTs). This allows the system administrator to fine-tune on which CPU
533 cores the Lustre service threads are run, for both OSS and MDS services,
534 as well as on the client.
536 <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
537 system functions such as system monitoring, HA heartbeat, or similar
538 tasks. On the client it may be useful to restrict Lustre RPC service
539 threads to a small subset of cores so that they do not interfere with
540 computation, or because these cores are directly attached to the network
543 <para>By default, the Lustre software will automatically generate CPU
544 partitions (CPT) based on the number of CPUs in the system.
545 The CPT count can be explicitly set on the libcfs module using
546 <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
547 The value of <literal>cpu_npartitions</literal> must be an integer between
548 1 and the number of online CPUs.
550 <para condition='l29'>In Lustre 2.9 and later the default is to use
551 one CPT per NUMA node. In earlier versions of Lustre, by default there
552 was a single CPT if the online CPU core count was four or fewer, and
553 additional CPTs would be created depending on the number of CPU cores,
554 typically with 4-8 cores per CPT.
557 <para>Setting <literal>cpu_npartitions=1</literal> will disable most
558 of the SMP Node Affinity functionality.</para>
561 <title>CPU Partition String Patterns</title>
562 <para>CPU partitions can be described using string pattern notation.
563 If <literal>cpu_pattern=N</literal> is used, then there will be one
564 CPT for each NUMA node in the system, with each CPT mapping all of
565 the CPU cores for that NUMA node.
567 <para>It is also possible to explicitly specify the mapping between
568 CPU cores and CPTs, for example:</para>
572 <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
574 <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
575 contains cores 3, 5, 7. CPU cores 0 and 1 will not be used by Lustre
576 service threads, and could be used for node services such as
577 system monitoring, HA heartbeat threads, etc. The binding of
578 non-Lustre services to those CPU cores may be done in userspace
579 using <literal>numactl(8)</literal> or other application-specific
580 methods, but is beyond the scope of this document.</para>
584 <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
586 <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
587 node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
590 <para>The current configuration of the CPU partition can be read via
591 <literal>lctl get_parm cpu_partition_table</literal>. For example,
592 a simple 4-core system has a single CPT with all four CPU cores:
593 <screen>$ lctl get_param cpu_partition_table
594 cpu_partition_table=0 : 0 1 2 3</screen>
595 while a larger NUMA system with four 12-core CPUs may have four CPTs:
596 <screen>$ lctl get_param cpu_partition_table
598 0 : 0 1 2 3 4 5 6 7 8 9 10 11
599 1 : 12 13 14 15 16 17 18 19 20 21 22 23
600 2 : 24 25 26 27 28 29 30 31 32 33 34 35
601 3 : 36 37 38 39 40 41 42 43 44 45 46 47
606 <section xml:id="dbdoclet.lndtuning">
609 <primary>tuning</primary>
610 <secondary>LND tuning</secondary>
611 </indexterm>LND Tuning</title>
612 <para>LND tuning allows the number of threads per CPU partition to be
613 specified. An administrator can set the threads for both
614 <literal>ko2iblnd</literal> and
615 <literal>ksocklnd</literal> using the
616 <literal>nscheds</literal> parameter. This adjusts the number of threads for
617 each partition, not the overall number of threads on the LND.</para>
619 <title>ko2iblnd Tuning</title>
620 <para>The following table outlines the ko2iblnd module parameters to be used
622 <informaltable frame="all">
624 <colspec colname="c1" colwidth="50*" />
625 <colspec colname="c2" colwidth="50*" />
626 <colspec colname="c3" colwidth="50*" />
631 <emphasis role="bold">Module Parameter</emphasis>
636 <emphasis role="bold">Default Value</emphasis>
641 <emphasis role="bold">Description</emphasis>
650 <literal>service</literal>
655 <literal>987</literal>
659 <para>Service number (within RDMA_PS_TCP).</para>
665 <literal>cksum</literal>
674 <para>Set non-zero to enable message (not RDMA) checksums.</para>
680 <literal>timeout</literal>
685 <literal>50</literal>
689 <para>Timeout in seconds.</para>
695 <literal>nscheds</literal>
704 <para>Number of threads in each scheduler pool (per CPT). Value of
705 zero means we derive the number from the number of cores.</para>
711 <literal>conns_per_peer</literal>
716 <literal>4 (OmniPath), 1 (Everything else)</literal>
720 <para>Introduced in 2.10. Number of connections to each peer. Messages
721 are sent round-robin over the connection pool. Provides significant
722 improvement with OmniPath.</para>
728 <literal>ntx</literal>
733 <literal>512</literal>
737 <para>Number of message descriptors allocated for each pool at
738 startup. Grows at runtime. Shared by all CPTs.</para>
744 <literal>credits</literal>
749 <literal>256</literal>
753 <para>Number of concurrent sends on network.</para>
759 <literal>peer_credits</literal>
768 <para>Number of concurrent sends to 1 peer. Related/limited by IB
775 <literal>peer_credits_hiw</literal>
784 <para>When eagerly to return credits.</para>
790 <literal>peer_buffer_credits</literal>
799 <para>Number per-peer router buffer credits.</para>
805 <literal>peer_timeout</literal>
810 <literal>180</literal>
814 <para>Seconds without aliveness news to declare peer dead (less than
815 or equal to 0 to disable).</para>
821 <literal>ipif_name</literal>
826 <literal>ib0</literal>
830 <para>IPoIB interface name.</para>
836 <literal>retry_count</literal>
845 <para>Retransmissions when no ACK received.</para>
851 <literal>rnr_retry_count</literal>
860 <para>RNR retransmissions.</para>
866 <literal>keepalive</literal>
871 <literal>100</literal>
875 <para>Idle time in seconds before sending a keepalive.</para>
881 <literal>ib_mtu</literal>
890 <para>IB MTU 256/512/1024/2048/4096.</para>
896 <literal>concurrent_sends</literal>
905 <para>Send work-queue sizing. If zero, derived from
906 <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
913 <literal>map_on_demand</literal>
918 <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
922 <para>Number of fragments reserved for connection. If zero, use
923 global memory region (found to be security issue). If non-zero, use
924 FMR or FastReg for memory registration. Value needs to agree between
925 both peers of connection.</para>
931 <literal>fmr_pool_size</literal>
936 <literal>512</literal>
940 <para>Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime.
947 <literal>fmr_flush_trigger</literal>
952 <literal>384</literal>
956 <para>Number dirty FMRs that triggers pool flush.</para>
962 <literal>fmr_cache</literal>
971 <para>Non-zero to enable FMR caching.</para>
977 <literal>dev_failover</literal>
986 <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
993 <literal>require_privileged_port</literal>
1002 <para>Require privileged port when accepting connection.</para>
1008 <literal>use_privileged_port</literal>
1013 <literal>1</literal>
1017 <para>Use privileged port when initiating connection.</para>
1023 <literal>wrq_sge</literal>
1028 <literal>2</literal>
1032 <para>Introduced in 2.10. Number scatter/gather element groups per
1033 work request. Used to deal with fragmentations which can consume
1034 double the number of work requests.</para>
1042 <section xml:id="dbdoclet.nrstuning">
1045 <primary>tuning</primary>
1046 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1047 </indexterm>Network Request Scheduler (NRS) Tuning</title>
1048 <para>The Network Request Scheduler (NRS) allows the administrator to
1049 influence the order in which RPCs are handled at servers, on a per-PTLRPC
1050 service basis, by providing different policies that can be activated and
1051 tuned in order to influence the RPC ordering. The aim of this is to provide
1052 for better performance, and possibly discrete performance characteristics
1053 using future policies.</para>
1054 <para>The NRS policy state of a PTLRPC service can be read and set via the
1055 <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
1056 service's NRS policy state, run:</para>
1058 lctl get_param {service}.nrs_policies
1060 <para>For example, to read the NRS policy state of the
1061 <literal>ost_io</literal> service, run:</para>
1063 $ lctl get_param ost.OSS.ost_io.nrs_policies
1064 ost.OSS.ost_io.nrs_policies=
1103 high_priority_requests:
1141 <para>NRS policy state is shown in either one or two sections, depending on
1142 the PTLRPC service being queried. The first section is named
1143 <literal>regular_requests</literal> and is available for all PTLRPC
1144 services, optionally followed by a second section which is named
1145 <literal>high_priority_requests</literal>. This is because some PTLRPC
1146 services are able to treat some types of RPCs as higher priority ones, such
1147 that they are handled by the server with higher priority compared to other,
1148 regular RPC traffic. For PTLRPC services that do not support high-priority
1149 RPCs, you will only see the
1150 <literal>regular_requests</literal> section.</para>
1151 <para>There is a separate instance of each NRS policy on each PTLRPC
1152 service for handling regular and high-priority RPCs (if the service
1153 supports high-priority RPCs). For each policy instance, the following
1154 fields are shown:</para>
1155 <informaltable frame="all">
1157 <colspec colname="c1" colwidth="50*" />
1158 <colspec colname="c2" colwidth="50*" />
1163 <emphasis role="bold">Field</emphasis>
1168 <emphasis role="bold">Description</emphasis>
1177 <literal>name</literal>
1181 <para>The name of the policy.</para>
1187 <literal>state</literal>
1191 <para>The state of the policy; this can be any of
1192 <literal>invalid, stopping, stopped, starting, started</literal>.
1193 A fully enabled policy is in the
1194 <literal>started</literal> state.</para>
1200 <literal>fallback</literal>
1204 <para>Whether the policy is acting as a fallback policy or not. A
1205 fallback policy is used to handle RPCs that other enabled
1206 policies fail to handle, or do not support the handling of. The
1208 <literal>no, yes</literal>. Currently, only the FIFO policy can
1209 act as a fallback policy.</para>
1215 <literal>queued</literal>
1219 <para>The number of RPCs that the policy has waiting to be
1226 <literal>active</literal>
1230 <para>The number of RPCs that the policy is currently
1237 <para>To enable an NRS policy on a PTLRPC service run:</para>
1239 lctl set_param {service}.nrs_policies=
1240 <replaceable>policy_name</replaceable>
1242 <para>This will enable the policy
1243 <replaceable>policy_name</replaceable>for both regular and high-priority
1244 RPCs (if the PLRPC service supports high-priority RPCs) on the given
1245 service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
1246 service, run:</para>
1248 $ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
1249 ldlm.services.ldlm_cbd.nrs_policies=crrn
1252 <para>For PTLRPC services that support high-priority RPCs, you can also
1254 <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
1255 for handling only regular or high-priority RPCs on a given PTLRPC service,
1258 lctl set_param {service}.nrs_policies="
1259 <replaceable>policy_name</replaceable>
1260 <replaceable>reg|hp</replaceable>"
1262 <para>For example, to enable the TRR policy for handling only regular, but
1263 not high-priority RPCs on the
1264 <literal>ost_io</literal> service, run:</para>
1266 $ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
1267 ost.OSS.ost_io.nrs_policies="trr reg"
1271 <para>When enabling an NRS policy, the policy name must be given in
1272 lower-case characters, otherwise the operation will fail with an error
1278 <primary>tuning</primary>
1279 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1280 <tertiary>first in, first out (FIFO) policy</tertiary>
1281 </indexterm>First In, First Out (FIFO) policy</title>
1282 <para>The first in, first out (FIFO) policy handles RPCs in a service in
1283 the same order as they arrive from the LNet layer, so no special
1284 processing takes place to modify the RPC handling stream. FIFO is the
1285 default policy for all types of RPCs on all PTLRPC services, and is
1286 always enabled irrespective of the state of other policies, so that it
1287 can be used as a backup policy, in case a more elaborate policy that has
1288 been enabled fails to handle an RPC, or does not support handling a given
1290 <para>The FIFO policy has no tunables that adjust its behaviour.</para>
1295 <primary>tuning</primary>
1296 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1297 <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
1298 </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
1299 <para>The client round-robin over NIDs (CRR-N) policy performs batched
1300 round-robin scheduling of all types of RPCs, with each batch consisting
1301 of RPCs originating from the same client node, as identified by its NID.
1302 CRR-N aims to provide for better resource utilization across the cluster,
1303 and to help shorten completion times of jobs in some cases, by
1304 distributing available bandwidth more evenly across all clients.</para>
1305 <para>The CRR-N policy can be enabled on all types of PTLRPC services,
1306 and has the following tunable that can be used to adjust its
1311 <literal>{service}.nrs_crrn_quantum</literal>
1314 <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
1315 maximum allowed size of each batch of RPCs; the unit of measure is in
1316 number of RPCs. To read the maximum allowed batch size of a CRR-N
1319 lctl get_param {service}.nrs_crrn_quantum
1321 <para>For example, to read the maximum allowed batch size of a CRR-N
1322 policy on the ost_io service, run:</para>
1324 $ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
1325 ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
1329 <para>You can see that there is a separate maximum allowed batch size
1331 <literal>reg_quantum</literal>) and high-priority (
1332 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1333 high-priority RPCs).</para>
1334 <para>To set the maximum allowed batch size of a CRR-N policy on a
1335 given service, run:</para>
1337 lctl set_param {service}.nrs_crrn_quantum=
1338 <replaceable>1-65535</replaceable>
1340 <para>This will set the maximum allowed batch size on a given
1341 service, for both regular and high-priority RPCs (if the PLRPC
1342 service supports high-priority RPCs), to the indicated value.</para>
1343 <para>For example, to set the maximum allowed batch size on the
1344 ldlm_canceld service to 16 RPCs, run:</para>
1346 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1347 ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
1350 <para>For PTLRPC services that support high-priority RPCs, you can
1351 also specify a different maximum allowed batch size for regular and
1352 high-priority RPCs, by running:</para>
1354 $ lctl set_param {service}.nrs_crrn_quantum=
1355 <replaceable>reg_quantum|hp_quantum</replaceable>:
1356 <replaceable>1-65535</replaceable>"
1358 <para>For example, to set the maximum allowed batch size on the
1359 ldlm_canceld service, for high-priority RPCs to 32, run:</para>
1361 $ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
1362 ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
1365 <para>By using the last method, you can also set the maximum regular
1366 and high-priority RPC batch sizes to different values, in a single
1367 command invocation.</para>
1374 <primary>tuning</primary>
1375 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1376 <tertiary>object-based round-robin (ORR) policy</tertiary>
1377 </indexterm>Object-based Round-Robin (ORR) policy</title>
1378 <para>The object-based round-robin (ORR) policy performs batched
1379 round-robin scheduling of bulk read write (brw) RPCs, with each batch
1380 consisting of RPCs that pertain to the same backend-file system object,
1381 as identified by its OST FID.</para>
1382 <para>The ORR policy is only available for use on the ost_io service. The
1383 RPC batches it forms can potentially consist of mixed bulk read and bulk
1384 write RPCs. The RPCs in each batch are ordered in an ascending manner,
1385 based on either the file offsets, or the physical disk offsets of each
1386 RPC (only applicable to bulk read RPCs).</para>
1387 <para>The aim of the ORR policy is to provide for increased bulk read
1388 throughput in some cases, by ordering bulk read RPCs (and potentially
1389 bulk write RPCs), and thus minimizing costly disk seek operations.
1390 Performance may also benefit from any resulting improvement in resource
1391 utilization, or by taking advantage of better locality of reference
1392 between RPCs.</para>
1393 <para>The ORR policy has the following tunables that can be used to
1394 adjust its behaviour:</para>
1398 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
1401 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
1402 the maximum allowed size of each batch of RPCs; the unit of measure
1403 is in number of RPCs. To read the maximum allowed batch size of the
1404 ORR policy, run:</para>
1406 $ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
1407 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
1411 <para>You can see that there is a separate maximum allowed batch size
1413 <literal>reg_quantum</literal>) and high-priority (
1414 <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
1415 high-priority RPCs).</para>
1416 <para>To set the maximum allowed batch size for the ORR policy,
1419 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1420 <replaceable>1-65535</replaceable>
1422 <para>This will set the maximum allowed batch size for both regular
1423 and high-priority RPCs, to the indicated value.</para>
1424 <para>You can also specify a different maximum allowed batch size for
1425 regular and high-priority RPCs, by running:</para>
1427 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
1428 <replaceable>reg_quantum|hp_quantum</replaceable>:
1429 <replaceable>1-65535</replaceable>
1431 <para>For example, to set the maximum allowed batch size for regular
1432 RPCs to 128, run:</para>
1434 $ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1435 ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
1438 <para>By using the last method, you can also set the maximum regular
1439 and high-priority RPC batch sizes to different values, in a single
1440 command invocation.</para>
1444 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
1447 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
1448 determines whether the ORR policy orders RPCs within each batch based
1449 on logical file offsets or physical disk offsets. To read the offset
1450 type value for the ORR policy, run:</para>
1452 $ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
1453 ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
1454 hp_offset_type:logical
1457 <para>You can see that there is a separate offset type value for
1459 <literal>reg_offset_type</literal>) and high-priority (
1460 <literal>hp_offset_type</literal>) RPCs.</para>
1461 <para>To set the ordering type for the ORR policy, run:</para>
1463 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1464 <replaceable>physical|logical</replaceable>
1466 <para>This will set the offset type for both regular and
1467 high-priority RPCs, to the indicated value.</para>
1468 <para>You can also specify a different offset type for regular and
1469 high-priority RPCs, by running:</para>
1471 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
1472 <replaceable>reg_offset_type|hp_offset_type</replaceable>:
1473 <replaceable>physical|logical</replaceable>
1475 <para>For example, to set the offset type for high-priority RPCs to
1476 physical disk offsets, run:</para>
1478 $ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1479 ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
1481 <para>By using the last method, you can also set offset type for
1482 regular and high-priority RPCs to different values, in a single
1483 command invocation.</para>
1485 <para>Irrespective of the value of this tunable, only logical
1486 offsets can, and are used for ordering bulk write RPCs.</para>
1491 <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
1494 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
1495 the type of RPCs that the ORR policy will handle. To read the types
1496 of supported RPCs by the ORR policy, run:</para>
1498 $ lctl get_param ost.OSS.ost_io.nrs_orr_supported
1499 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
1500 hp_supported=reads_and_writes
1503 <para>You can see that there is a separate supported 'RPC types'
1505 <literal>reg_supported</literal>) and high-priority (
1506 <literal>hp_supported</literal>) RPCs.</para>
1507 <para>To set the supported RPC types for the ORR policy, run:</para>
1509 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1510 <replaceable>reads|writes|reads_and_writes</replaceable>
1512 <para>This will set the supported RPC types for both regular and
1513 high-priority RPCs, to the indicated value.</para>
1514 <para>You can also specify a different supported 'RPC types' value
1515 for regular and high-priority RPCs, by running:</para>
1517 $ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
1518 <replaceable>reg_supported|hp_supported</replaceable>:
1519 <replaceable>reads|writes|reads_and_writes</replaceable>
1521 <para>For example, to set the supported RPC types to bulk read and
1522 bulk write RPCs for regular requests, run:</para>
1525 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1526 ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
1529 <para>By using the last method, you can also set the supported RPC
1530 types for regular and high-priority RPC to different values, in a
1531 single command invocation.</para>
1538 <primary>tuning</primary>
1539 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1540 <tertiary>Target-based round-robin (TRR) policy</tertiary>
1541 </indexterm>Target-based Round-Robin (TRR) policy</title>
1542 <para>The target-based round-robin (TRR) policy performs batched
1543 round-robin scheduling of brw RPCs, with each batch consisting of RPCs
1544 that pertain to the same OST, as identified by its OST index.</para>
1545 <para>The TRR policy is identical to the object-based round-robin (ORR)
1546 policy, apart from using the brw RPC's target OST index instead of the
1547 backend-fs object's OST FID, for determining the RPC scheduling order.
1548 The goals of TRR are effectively the same as for ORR, and it uses the
1549 following tunables to adjust its behaviour:</para>
1553 <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
1555 <para>The purpose of this tunable is exactly the same as for the
1556 <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
1557 policy, and you can use it in exactly the same way.</para>
1561 <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
1563 <para>The purpose of this tunable is exactly the same as for the
1564 <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
1565 ORR policy, and you can use it in exactly the same way.</para>
1569 <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
1571 <para>The purpose of this tunable is exactly the same as for the
1572 <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
1573 ORR policy, and you can use it in exactly the sme way.</para>
1577 <section xml:id="dbdoclet.tbftuning" condition='l26'>
1580 <primary>tuning</primary>
1581 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1582 <tertiary>Token Bucket Filter (TBF) policy</tertiary>
1583 </indexterm>Token Bucket Filter (TBF) policy</title>
1584 <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
1585 Lustre services to enforce the RPC rate limit on clients/jobs for QoS
1586 (Quality of Service) purposes.</para>
1588 <title>The internal structure of TBF policy</title>
1591 <imagedata scalefit="1" width="50%"
1592 fileref="figures/TBF_policy.png" />
1595 <phrase>The internal structure of TBF policy</phrase>
1599 <para>When a RPC request arrives, TBF policy puts it to a waiting queue
1600 according to its classification. The classification of RPC requests is
1601 based on either NID or JobID of the RPC according to the configure of
1602 TBF. TBF policy maintains multiple queues in the system, one queue for
1603 each category in the classification of RPC requests. The requests waits
1604 for tokens in the FIFO queue before they have been handled so as to keep
1605 the RPC rates under the limits.</para>
1606 <para>When Lustre services are too busy to handle all of the requests in
1607 time, all of the specified rates of the queues will not be satisfied.
1608 Nothing bad will happen except some of the RPC rates are slower than
1609 configured. In this case, the queue with higher rate will have an
1610 advantage over the queues with lower rates, but none of them will be
1612 <para>To manage the RPC rate of queues, we don't need to set the rate of
1613 each queue manually. Instead, we define rules which TBF policy matches to
1614 determine RPC rate limits. All of the defined rules are organized as an
1615 ordered list. Whenever a queue is newly created, it goes though the rule
1616 list and takes the first matched rule as its rule, so that the queue
1617 knows its RPC token rate. A rule can be added to or removed from the list
1618 at run time. Whenever the list of rules is changed, the queues will
1619 update their matched rules.</para>
1620 <section remap="h4">
1621 <title>Enable TBF policy</title>
1622 <para>Command:</para>
1623 <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf <<replaceable>policy</replaceable>>"
1625 <para>For now, the RPCs can be classified into the different types
1626 according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
1627 policy, you can specify one of the types, or just use "tbf" to enable
1628 all of them to do a fine-grained RPC requests classification.</para>
1629 <para>Example:</para>
1630 <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
1631 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
1632 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
1633 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
1634 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
1635 $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
1637 <section remap="h4">
1638 <title>Start a TBF rule</title>
1639 <para>The TBF rule is defined in the parameter
1640 <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
1641 <para>Command:</para>
1642 <screen>lctl set_param x.x.x.nrs_tbf_rule=
1643 "[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
1645 <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
1646 policy rule's name and '<replaceable>arguments</replaceable>' is a
1647 string to specify the detailed rule according to the different types.
1650 <para>Next, the different types of TBF policies will be described.</para>
1652 <para><emphasis role="bold">NID based TBF policy</emphasis></para>
1653 <para>Command:</para>
1654 <screen>lctl set_param x.x.x.nrs_tbf_rule=
1655 "[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
1657 <para>'<replaceable>nidlist</replaceable>' uses the same format
1658 as configuring LNET route. '<replaceable>rate</replaceable>' is
1659 the (upper limit) RPC rate of the rule.</para>
1660 <para>Example:</para>
1661 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1662 "start other_clients nid={192.168.*.*@tcp} rate=50"
1663 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1664 "start computes nid={192.168.1.[2-128]@tcp} rate=500"
1665 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1666 "start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
1667 <para>In this example, the rate of processing RPC requests from
1668 compute nodes is at most 5x as fast as those from login nodes.
1669 The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
1671 <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1672 ost.OSS.ost_io.nrs_tbf_rule=
1675 loginnode {192.168.1.1@tcp} 100, ref 0
1676 computes {192.168.1.[2-128]@tcp} 500, ref 0
1677 other_clients {192.168.*.*@tcp} 50, ref 0
1678 default {*} 10000, ref 0
1679 high_priority_requests:
1681 loginnode {192.168.1.1@tcp} 100, ref 0
1682 computes {192.168.1.[2-128]@tcp} 500, ref 0
1683 other_clients {192.168.*.*@tcp} 50, ref 0
1684 default {*} 10000, ref 0</screen>
1685 <para>Also, the rule can be written in <literal>reg</literal> and
1686 <literal>hp</literal> formats:</para>
1687 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1688 "reg start loginnode nid={192.168.1.1@tcp} rate=100"
1689 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1690 "hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
1693 <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
1694 <para>For the JobID, please see
1695 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1696 linkend="dbdoclet.jobstats" /> for more details.</para>
1697 <para>Command:</para>
1698 <screen>lctl set_param x.x.x.nrs_tbf_rule=
1699 "[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
1701 <para>Wildcard is supported in
1702 {<replaceable>jobid_list</replaceable>}.</para>
1703 <para>Example:</para>
1704 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1705 "start iozone_user jobid={iozone.500} rate=100"
1706 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1707 "start dd_user jobid={dd.*} rate=50"
1708 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1709 "start user1 jobid={*.600} rate=10"
1710 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1711 "start user2 jobid={io*.10* *.500} rate=200"</screen>
1712 <para>Also, the rule can be written in <literal>reg</literal> and
1713 <literal>hp</literal> formats:</para>
1714 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1715 "hp start iozone_user1 jobid={iozone.500} rate=100"
1716 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1717 "reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
1720 <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
1721 <para>Command:</para>
1722 <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
1723 "[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
1725 <para>Example:</para>
1726 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1727 "start user1 opcode={ost_read} rate=100"
1728 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1729 "start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
1730 <para>Also, the rule can be written in <literal>reg</literal> and
1731 <literal>hp</literal> formats:</para>
1732 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1733 "hp start iozone_user1 opcode={ost_read} rate=100"
1734 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1735 "reg start iozone_user1 opcode={ost_read} rate=100"</screen>
1738 <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
1739 <para>Command:</para>
1740 <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1741 "[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
1742 $ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1743 "[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
1744 <para>Exapmle:</para>
1745 <para>Limit the rate of RPC requests of the uid 500</para>
1746 <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1747 "start tbf_name uid={500} rate=100"</screen>
1748 <para>Limit the rate of RPC requests of the gid 500</para>
1749 <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1750 "start tbf_name gid={500} rate=100"</screen>
1751 <para>Also, you can use the following rule to control all reqs
1753 <para>Start the tbf uid QoS on MDS:</para>
1754 <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
1755 <para>Limit the rate of RPC requests of the uid 500</para>
1756 <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
1757 "start tbf_name uid={500} rate=100"</screen>
1760 <para><emphasis role="bold">Policy combination</emphasis></para>
1761 <para>To support TBF rules with complex expressions of conditions,
1762 TBF classifier is extented to classify RPC in a more fine-grained
1763 way. This feature supports logical conditional conjunction and
1764 disjunction operations among different types.
1766 "&" represents the conditional conjunction and
1767 "," represents the conditional disjunction.</para>
1768 <para>Example:</para>
1769 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1770 "start comp_rule opcode={ost_write}&jobid={dd.0},\
1771 nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
1772 <para>In this example, those RPCs whose <literal>opcode</literal> is
1773 ost_write and <literal>jobid</literal> is dd.0, or
1774 <literal>nid</literal> satisfies the condition of
1775 {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
1777 The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
1779 <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1780 ost.OSS.ost_io.nrs_tbf_rule=
1783 comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1784 default * 10000, ref 0
1786 comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1787 default * 10000, ref 0
1788 high_priority_requests:
1790 comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1791 default * 10000, ref 0
1793 comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
1794 default * 10000, ref 0</screen>
1795 <para>Example:</para>
1796 <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
1797 "start tbf_name uid={500}&gid={500} rate=100"</screen>
1798 <para>In this example, those RPC requests whose uid is 500 and
1799 gid is 500 will be processed at the rate of 100 req/sec.</para>
1803 <section remap="h4">
1804 <title>Change a TBF rule</title>
1805 <para>Command:</para>
1806 <screen>lctl set_param x.x.x.nrs_tbf_rule=
1807 "[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
1809 <para>Example:</para>
1810 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1811 "change loginnode rate=200"
1812 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1813 "reg change loginnode rate=200"
1814 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1815 "hp change loginnode rate=200"
1818 <section remap="h4">
1819 <title>Stop a TBF rule</title>
1820 <para>Command:</para>
1821 <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
1822 <replaceable>rule_name</replaceable>"</screen>
1823 <para>Example:</para>
1824 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
1825 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
1826 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
1828 <section remap="h4">
1829 <title>Rule options</title>
1830 <para>To support more flexible rule conditions, the following options
1834 <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
1835 <para>By default, a newly started rule is prior to the old ones,
1836 but by specifying the argument '<literal>rank=</literal>' when
1837 inserting a new rule with "<literal>start</literal>" command,
1838 the rank of the rule can be changed. Also, it can be changed by
1839 "<literal>change</literal>" command.
1841 <para>Command:</para>
1842 <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1843 "start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
1844 lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1845 "change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
1847 <para>By specifying the existing rule
1848 '<replaceable>obj_rule_name</replaceable>', the new rule
1849 '<replaceable>rule_name</replaceable>' will be moved to the front of
1850 '<replaceable>obj_rule_name</replaceable>'.</para>
1851 <para>Example:</para>
1852 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1853 "start computes nid={192.168.1.[2-128]@tcp} rate=500"
1854 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1855 "start user1 jobid={iozone.500 dd.500} rate=100"
1856 $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
1857 "start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
1858 <para>In this example, rule "iozone_user1" is added to the front of
1859 rule "computes". We can see the order by the following command:
1861 <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
1862 ost.OSS.ost_io.nrs_tbf_rule=
1865 user1 jobid={iozone.500 dd.500} 100, ref 0
1866 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1867 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1868 default * 10000, ref 0
1870 user1 jobid={iozone.500 dd.500} 100, ref 0
1871 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1872 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1873 default * 10000, ref 0
1874 high_priority_requests:
1876 user1 jobid={iozone.500 dd.500} 100, ref 0
1877 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1878 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1879 default * 10000, ref 0
1881 user1 jobid={iozone.500 dd.500} 100, ref 0
1882 iozone_user1 opcode={ost_read ost_write} 200, ref 0
1883 computes nid={192.168.1.[2-128]@tcp} 500, ref 0
1884 default * 10000, ref 0</screen>
1887 <para><emphasis role="bold">TBF realtime policies under congestion
1889 <para>During TBF evaluation, we find that when the sum of I/O
1890 bandwidth requirements for all classes exceeds the system capacity,
1891 the classes with the same rate limits get less bandwidth than if
1892 preconfigured evenly. The reason for this is the heavy load on a
1893 congested server will result in some missed deadlines for some
1894 classes. The number of the calculated tokens may be larger than 1
1895 during dequeuing. In the original implementation, all classes are
1896 equally handled to simply discard exceeding tokens.</para>
1897 <para>Thus, a Hard Token Compensation (HTC) strategy has been
1898 implemented. A class can be configured with the HTC feature by the
1899 rule it matches. This feature means that requests in this kind of
1900 class queues have high real-time requirements and that the bandwidth
1901 assignment must be satisfied as good as possible. When deadline
1902 misses happen, the class keeps the deadline unchanged and the time
1903 residue(the remainder of elapsed time divided by 1/r) is compensated
1904 to the next round. This ensures that the next idle I/O thread will
1905 always select this class to serve until all accumulated exceeding
1906 tokens are handled or there are no pending requests in the class
1908 <para>Command:</para>
1909 <para>A new command format is added to enable the realtime feature
1911 <screen>lctl set_param x.x.x.nrs_tbf_rule=\
1912 "start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
1913 <para>Example:</para>
1914 <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
1915 "start realjob jobid={dd.0} rate=100 realtime=1</screen>
1916 <para>This example rule means the RPC requests whose JobID is dd.0
1917 will be processed at the rate of 100req/sec in realtime.</para>
1922 <section xml:id="dbdoclet.delaytuning" condition='l2A'>
1925 <primary>tuning</primary>
1926 <secondary>Network Request Scheduler (NRS) Tuning</secondary>
1927 <tertiary>Delay policy</tertiary>
1928 </indexterm>Delay policy</title>
1929 <para>The NRS Delay policy seeks to perturb the timing of request
1930 processing at the PtlRPC layer, with the goal of simulating high server
1931 load, and finding and exposing timing related problems. When this policy
1932 is active, upon arrival of a request the policy will calculate an offset,
1933 within a defined, user-configurable range, from the request arrival
1934 time, to determine a time after which the request should be handled.
1935 The request is then stored using the cfs_binheap implementation,
1936 which sorts the request according to the assigned start time.
1937 Requests are removed from the binheap for handling once their start
1938 time has been passed.</para>
1939 <para>The Delay policy can be enabled on all types of PtlRPC services,
1940 and has the following tunables that can be used to adjust its behavior:
1945 <literal>{service}.nrs_delay_min</literal>
1948 <literal>{service}.nrs_delay_min</literal> tunable controls the
1949 minimum amount of time, in seconds, that a request will be delayed by
1950 this policy. The default is 5 seconds. To read this value run:</para>
1952 lctl get_param {service}.nrs_delay_min</screen>
1953 <para>For example, to read the minimum delay set on the ost_io
1954 service, run:</para>
1956 $ lctl get_param ost.OSS.ost_io.nrs_delay_min
1957 ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
1958 hp_delay_min:5</screen>
1959 <para>To set the minimum delay in RPC processing, run:</para>
1961 lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
1962 <para>This will set the minimum delay time on a given service, for both
1963 regular and high-priority RPCs (if the PtlRPC service supports
1964 high-priority RPCs), to the indicated value.</para>
1965 <para>For example, to set the minimum delay time on the ost_io service
1968 $ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
1969 ost.OSS.ost_io.nrs_delay_min=10</screen>
1970 <para>For PtlRPC services that support high-priority RPCs, to set a
1971 different minimum delay time for regular and high-priority RPCs, run:
1974 lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
1976 <para>For example, to set the minimum delay time on the ost_io service
1977 for high-priority RPCs to 3, run:</para>
1979 $ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
1980 ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
1981 <para>Note, in all cases the minimum delay time cannot exceed the
1982 maximum delay time.</para>
1986 <literal>{service}.nrs_delay_max</literal>
1989 <literal>{service}.nrs_delay_max</literal> tunable controls the
1990 maximum amount of time, in seconds, that a request will be delayed by
1991 this policy. The default is 300 seconds. To read this value run:
1993 <screen>lctl get_param {service}.nrs_delay_max</screen>
1994 <para>For example, to read the maximum delay set on the ost_io
1995 service, run:</para>
1997 $ lctl get_param ost.OSS.ost_io.nrs_delay_max
1998 ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
1999 hp_delay_max:300</screen>
2000 <para>To set the maximum delay in RPC processing, run:</para>
2001 <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
2003 <para>This will set the maximum delay time on a given service, for both
2004 regular and high-priority RPCs (if the PtlRPC service supports
2005 high-priority RPCs), to the indicated value.</para>
2006 <para>For example, to set the maximum delay time on the ost_io service
2009 $ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
2010 ost.OSS.ost_io.nrs_delay_max=60</screen>
2011 <para>For PtlRPC services that support high-priority RPCs, to set a
2012 different maximum delay time for regular and high-priority RPCs, run:
2014 <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
2015 <para>For example, to set the maximum delay time on the ost_io service
2016 for high-priority RPCs to 30, run:</para>
2018 $ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
2019 ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
2020 <para>Note, in all cases the maximum delay time cannot be less than the
2021 minimum delay time.</para>
2025 <literal>{service}.nrs_delay_pct</literal>
2028 <literal>{service}.nrs_delay_pct</literal> tunable controls the
2029 percentage of requests that will be delayed by this policy. The
2030 default is 100. Note, when a request is not selected for handling by
2031 the delay policy due to this variable then the request will be handled
2032 by whatever fallback policy is defined for that service. If no other
2033 fallback policy is defined then the request will be handled by the
2034 FIFO policy. To read this value run:</para>
2035 <screen>lctl get_param {service}.nrs_delay_pct</screen>
2036 <para>For example, to read the percentage of requests being delayed on
2037 the ost_io service, run:</para>
2039 $ lctl get_param ost.OSS.ost_io.nrs_delay_pct
2040 ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
2041 hp_delay_pct:100</screen>
2042 <para>To set the percentage of delayed requests, run:</para>
2044 lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
2045 <para>This will set the percentage of requests delayed on a given
2046 service, for both regular and high-priority RPCs (if the PtlRPC service
2047 supports high-priority RPCs), to the indicated value.</para>
2048 <para>For example, to set the percentage of delayed requests on the
2049 ost_io service to 50, run:</para>
2051 $ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
2052 ost.OSS.ost_io.nrs_delay_pct=50
2054 <para>For PtlRPC services that support high-priority RPCs, to set a
2055 different delay percentage for regular and high-priority RPCs, run:
2057 <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
2059 <para>For example, to set the percentage of delayed requests on the
2060 ost_io service for high-priority RPCs to 5, run:</para>
2061 <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
2062 ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
2068 <section xml:id="dbdoclet.50438272_25884">
2071 <primary>tuning</primary>
2072 <secondary>lockless I/O</secondary>
2073 </indexterm>Lockless I/O Tunables</title>
2074 <para>The lockless I/O tunable feature allows servers to ask clients to do
2075 lockless I/O (the server does the locking on behalf of clients) for
2076 contended files to avoid lock ping-pong.</para>
2077 <para>The lockless I/O patch introduces these tunables:</para>
2081 <emphasis role="bold">OST-side:</emphasis>
2084 ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
2087 <literal>contended_locks</literal>- If the number of lock conflicts in
2088 the scan of granted and waiting queues at contended_locks is exceeded,
2089 the resource is considered to be contended.</para>
2091 <literal>contention_seconds</literal>- The resource keeps itself in a
2092 contended state as set in the parameter.</para>
2094 <literal>max_nolock_bytes</literal>- Server-side locking set only for
2095 requests less than the blocks set in the
2096 <literal>max_nolock_bytes</literal> parameter. If this tunable is
2097 set to zero (0), it disables server-side locking for read/write
2102 <emphasis role="bold">Client-side:</emphasis>
2105 /proc/fs/lustre/llite/lustre-*
2108 <literal>contention_seconds</literal>-
2109 <literal>llite</literal> inode remembers its contended state for the
2110 time specified in this parameter.</para>
2114 <emphasis role="bold">Client-side statistics:</emphasis>
2117 <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
2118 rows for lockless I/O statistics.</para>
2120 <literal>lockless_read_bytes</literal> and
2121 <literal>lockless_write_bytes</literal>- To count the total bytes read
2122 or written, the client makes its own decisions based on the request
2123 size. The client does not communicate with the server if the request
2124 size is smaller than the
2125 <literal>min_nolock_size</literal>, without acquiring locks by the
2130 <section condition="l29">
2133 <primary>tuning</primary>
2134 <secondary>with lfs ladvise</secondary>
2136 Server-Side Advice and Hinting
2138 <section><title>Overview</title>
2139 <para>Use the <literal>lfs ladvise</literal> command to give file access
2140 advices or hints to servers.</para>
2141 <screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
2142 [--start|-s START[kMGT]]
2143 {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
2144 <emphasis>file</emphasis> ...
2147 <informaltable frame="all">
2149 <colspec colname="c1" colwidth="50*"/>
2150 <colspec colname="c2" colwidth="50*"/>
2154 <para><emphasis role="bold">Option</emphasis></para>
2157 <para><emphasis role="bold">Description</emphasis></para>
2164 <para><literal>-a</literal>, <literal>--advice=</literal>
2165 <literal>ADVICE</literal></para>
2168 <para>Give advice or hint of type <literal>ADVICE</literal>.
2169 Advice types are:</para>
2170 <para><literal>willread</literal> to prefetch data into server
2172 <para><literal>dontneed</literal> to cleanup data cache on
2174 <para><literal>lockahead</literal> Request an LDLM extent lock
2175 of the given mode on the given byte range </para>
2176 <para><literal>noexpand</literal> Disable extent lock expansion
2177 behavior for I/O to this file descriptor</para>
2182 <para><literal>-b</literal>, <literal>--background</literal>
2186 <para>Enable the advices to be sent and handled asynchronously.
2192 <para><literal>-s</literal>, <literal>--start=</literal>
2193 <literal>START_OFFSET</literal></para>
2196 <para>File range starts from <literal>START_OFFSET</literal>
2202 <para><literal>-e</literal>, <literal>--end=</literal>
2203 <literal>END_OFFSET</literal></para>
2206 <para>File range ends at (not including)
2207 <literal>END_OFFSET</literal>. This option may not be
2208 specified at the same time as the <literal>-l</literal>
2214 <para><literal>-l</literal>, <literal>--length=</literal>
2215 <literal>LENGTH</literal></para>
2218 <para>File range has length of <literal>LENGTH</literal>.
2219 This option may not be specified at the same time as the
2220 <literal>-e</literal> option.</para>
2225 <para><literal>-m</literal>, <literal>--mode=</literal>
2226 <literal>MODE</literal></para>
2229 <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
2230 Request a lock with this mode.</para>
2237 <para>Typically, <literal>lfs ladvise</literal> forwards the advice to
2238 Lustre servers without guaranteeing when and what servers will react to
2239 the advice. Actions may or may not triggered when the advices are
2240 recieved, depending on the type of the advice, as well as the real-time
2241 decision of the affected server-side components.</para>
2242 <para>A typical usage of ladvise is to enable applications and users with
2243 external knowledge to intervene in server-side cache management. For
2244 example, if a bunch of different clients are doing small random reads of a
2245 file, prefetching pages into OSS cache with big linear reads before the
2246 random IO is a net benefit. Fetching that data into each client cache with
2247 fadvise() may not be, due to much more data being sent to the client.
2250 <literal>ladvise lockahead</literal> is different in that it attempts to
2251 control LDLM locking behavior by explicitly requesting LDLM locks in
2252 advance of use. This does not directly affect caching behavior, instead
2253 it is used in special cases to avoid pathological results (lock exchange)
2254 from the normal LDLM locking behavior.
2257 Note that the <literal>noexpand</literal> advice works on a specific
2258 file descriptor, so using it via lfs has no effect. It must be used
2259 on a particular file descriptor which is used for i/o to have any effect.
2261 <para>The main difference between the Linux <literal>fadvise()</literal>
2262 system call and <literal>lfs ladvise</literal> is that
2263 <literal>fadvise()</literal> is only a client side mechanism that does
2264 not pass the advice to the filesystem, while <literal>ladvise</literal>
2265 can send advices or hints to the Lustre server side.</para>
2267 <section><title>Examples</title>
2268 <para>The following example gives the OST(s) holding the first 1GB of
2269 <literal>/mnt/lustre/file1</literal>a hint that the first 1GB of the
2270 file will be read soon.</para>
2271 <screen>client1$ lfs ladvise -a willread -s 0 -e 1048576000 /mnt/lustre/file1
2273 <para>The following example gives the OST(s) holding the first 1GB of
2274 <literal>/mnt/lustre/file1</literal> a hint that the first 1GB of file
2275 will not be read in the near future, thus the OST(s) could clear the
2276 cache of the file in the memory.</para>
2277 <screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
2279 <para>The following example requests an LDLM read lock on the first
2280 1 MiB of <literal>/mnt/lustre/file1</literal>. This will attempt to
2281 request a lock from the OST holding that region of the file.</para>
2282 <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
2284 <para>The following example requests an LDLM write lock on
2285 [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>. This will
2286 attempt to request a lock from the OST holding that region of the
2288 <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
2292 <section condition="l29">
2295 <primary>tuning</primary>
2296 <secondary>Large Bulk IO</secondary>
2298 Large Bulk IO (16MB RPC)
2300 <section><title>Overview</title>
2301 <para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
2302 to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
2303 required to transfer the same amount of data between clients and
2304 servers. With a larger RPC size, the OSS can submit more data to the
2305 underlying disks at once, therefore it can produce larger disk I/Os
2306 to fully utilize the increasing bandwidth of disks.</para>
2307 <para>At client connection time, clients will negotiate with
2308 servers what the maximum RPC size it is possible to use, but the
2309 client can always send RPCs smaller than this maximum.</para>
2310 <para>The parameter <literal>brw_size</literal> is used on the OST
2311 to tell the client the maximum (preferred) IO size. All clients that
2312 talk to this target should never send an RPC greater than this size.
2313 Clients can individually set a smaller RPC size limit via the
2314 <literal>osc.*.max_pages_per_rpc</literal> tunable.
2317 <para>The smallest <literal>brw_size</literal> that can be set for
2318 ZFS OSTs is the <literal>recordsize</literal> of that dataset. This
2319 ensures that the client can always write a full ZFS file block if it
2320 has enough dirty data, and does not otherwise force it to do read-
2321 modify-write operations for every RPC.
2325 <section><title>Usage</title>
2326 <para>In order to enable a larger RPC size,
2327 <literal>brw_size</literal> must be changed to an IO size value up to
2328 16MB. To temporarily change <literal>brw_size</literal>, the
2329 following command should be run on the OSS:</para>
2330 <screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
2331 <para>To persistently change <literal>brw_size</literal>, the
2332 following command should be run:</para>
2333 <screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
2334 <para>When a client connects to an OST target, it will fetch
2335 <literal>brw_size</literal> from the target and pick the maximum value
2336 of <literal>brw_size</literal> and its local setting for
2337 <literal>max_pages_per_rpc</literal> as the actual RPC size.
2338 Therefore, the <literal>max_pages_per_rpc</literal> on the client side
2339 would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to enable
2340 a 16MB RPC. To temporarily make the change, the following command
2341 should be run on the client to set
2342 <literal>max_pages_per_rpc</literal>:</para>
2343 <screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
2344 <para>To persistently make this change, the following command should
2346 <screen>client$ lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
2347 <caution><para>The <literal>brw_size</literal> of an OST can be
2348 changed on the fly. However, clients have to be remounted to
2349 renegotiate the new maximum RPC size.</para></caution>
2352 <section xml:id="dbdoclet.50438272_80545">
2355 <primary>tuning</primary>
2356 <secondary>for small files</secondary>
2357 </indexterm>Improving Lustre I/O Performance for Small Files</title>
2358 <para>An environment where an application writes small file chunks from
2359 many clients to a single file can result in poor I/O performance. To
2360 improve the performance of the Lustre file system with small files:</para>
2363 <para>Have the application aggregate writes some amount before
2364 submitting them to the Lustre file system. By default, the Lustre
2365 software enforces POSIX coherency semantics, so it results in lock
2366 ping-pong between client nodes if they are all writing to the same
2367 file at one time.</para>
2368 <para>Using MPI-IO Collective Write functionality in
2369 the Lustre ADIO driver is one way to achieve this in a straight
2370 forward manner if the application is already using MPI-IO.</para>
2373 <para>Have the application do 4kB
2374 <literal>O_DIRECT</literal> sized I/O to the file and disable locking
2375 on the output file. This avoids partial-page IO submissions and, by
2376 disabling locking, you avoid contention between clients.</para>
2379 <para>Have the application write contiguous data.</para>
2382 <para>Add more disks or use SSD disks for the OSTs. This dramatically
2383 improves the IOPS rate. Consider creating larger OSTs rather than many
2384 smaller OSTs due to less overhead (journal, connections, etc).</para>
2387 <para>Use RAID-1+0 OSTs instead of RAID-5/6. There is RAID parity
2388 overhead for writing small chunks of data to disk.</para>
2392 <section xml:id="dbdoclet.50438272_45406">
2395 <primary>tuning</primary>
2396 <secondary>write performance</secondary>
2397 </indexterm>Understanding Why Write Performance is Better Than Read
2399 <para>Typically, the performance of write operations on a Lustre cluster is
2400 better than read operations. When doing writes, all clients are sending
2401 write RPCs asynchronously. The RPCs are allocated, and written to disk in
2402 the order they arrive. In many cases, this allows the back-end storage to
2403 aggregate writes efficiently.</para>
2404 <para>In the case of read operations, the reads from clients may come in a
2405 different order and need a lot of seeking to get read from the disk. This
2406 noticeably hampers the read throughput.</para>
2407 <para>Currently, there is no readahead on the OSTs themselves, though the
2408 clients do readahead. If there are lots of clients doing reads it would not
2409 be possible to do any readahead in any case because of memory consumption
2410 (consider that even a single RPC (1 MB) readahead for 1000 clients would
2411 consume 1 GB of RAM).</para>
2412 <para>For file systems that use socklnd (TCP, Ethernet) as interconnect,
2413 there is also additional CPU overhead because the client cannot receive
2414 data without copying it from the network buffers. In the write case, the
2415 client CAN send data without the additional data copy. This means that the
2416 client is more likely to become CPU-bound during reads than writes.</para>
2420 vim:expandtab:shiftwidth=2:tabstop=8: