1 <?xml version='1.0' encoding='UTF-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
3 xml:lang="en-US" xml:id="lustreproc">
4 <title xml:id="lustreproc.title">Lustre Parameters</title>
5 <para>The <literal>/proc</literal> and <literal>/sys</literal> file systems
6 acts as an interface to internal data structures in the kernel. This chapter
7 describes parameters and tunables that are useful for optimizing and
8 monitoring aspects of a Lustre file system. It includes these sections:</para>
11 <para><xref linkend="dbdoclet.50438271_83523"/></para>
16 <title>Introduction to Lustre Parameters</title>
17 <para>Lustre parameters and statistics files provide an interface to
18 internal data structures in the kernel that enables monitoring and
19 tuning of many aspects of Lustre file system and application performance.
20 These data structures include settings and metrics for components such
21 as memory, networking, file systems, and kernel housekeeping routines,
22 which are available throughout the hierarchical file layout.
24 <para>Typically, metrics are accessed via <literal>lctl get_param</literal>
25 files and settings are changed by via <literal>lctl set_param</literal>.
26 Some data is server-only, some data is client-only, and some data is
27 exported from the client to the server and is thus duplicated in both
30 <para>In the examples in this chapter, <literal>#</literal> indicates
31 a command is entered as root. Lustre servers are named according to the
32 convention <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
33 The standard UNIX wildcard designation (*) is used.</para>
35 <para>Some examples are shown below:</para>
38 <para> To obtain data from a Lustre client:</para>
39 <screen># lctl list_param osc.*
40 osc.testfs-OST0000-osc-ffff881071d5cc00
41 osc.testfs-OST0001-osc-ffff881071d5cc00
42 osc.testfs-OST0002-osc-ffff881071d5cc00
43 osc.testfs-OST0003-osc-ffff881071d5cc00
44 osc.testfs-OST0004-osc-ffff881071d5cc00
45 osc.testfs-OST0005-osc-ffff881071d5cc00
46 osc.testfs-OST0006-osc-ffff881071d5cc00
47 osc.testfs-OST0007-osc-ffff881071d5cc00
48 osc.testfs-OST0008-osc-ffff881071d5cc00</screen>
49 <para>In this example, information about OST connections available
50 on a client is displayed (indicated by "osc").</para>
55 <para> To see multiple levels of parameters, use multiple
56 wildcards:<screen># lctl list_param osc.*.*
57 osc.testfs-OST0000-osc-ffff881071d5cc00.active
58 osc.testfs-OST0000-osc-ffff881071d5cc00.blocksize
59 osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
60 osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
61 osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
62 osc.testfs-OST0000-osc-ffff881071d5cc00.contention_seconds
63 osc.testfs-OST0000-osc-ffff881071d5cc00.cur_dirty_bytes
65 osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats</screen></para>
70 <para> To view a specific file, use <literal>lctl get_param</literal>:
71 <screen># lctl get_param osc.lustre-OST0000*.rpc_stats</screen></para>
74 <para>For more information about using <literal>lctl</literal>, see <xref
75 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_51490"/>.</para>
76 <para>Data can also be viewed using the <literal>cat</literal> command
77 with the full path to the file. The form of the <literal>cat</literal>
78 command is similar to that of the <literal>lctl get_param</literal>
79 command with some differences. Unfortunately, as the Linux kernel has
80 changed over the years, the location of statistics and parameter files
81 has also changed, which means that the Lustre parameter files may be
82 located in either the <literal>/proc</literal> directory, in the
83 <literal>/sys</literal> directory, and/or in the
84 <literal>/sys/kernel/debug</literal> directory, depending on the kernel
85 version and the Lustre version being used. The <literal>lctl</literal>
86 command insulates scripts from these changes and is preferred over direct
87 file access, unless as part of a high-performance monitoring system.
88 In the <literal>cat</literal> command:</para>
91 <para>Replace the dots in the path with slashes.</para>
94 <para>Prepend the path with the appropriate directory component:
95 <screen>/{proc,sys}/{fs,sys}/{lustre,lnet}</screen></para>
98 <para>For example, an <literal>lctl get_param</literal> command may look like
99 this:<screen># lctl get_param osc.*.uuid
100 osc.testfs-OST0000-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
101 osc.testfs-OST0001-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
103 <para>The equivalent <literal>cat</literal> command may look like this:
104 <screen># cat /proc/fs/lustre/osc/*/uuid
105 594db456-0685-bd16-f59b-e72ee90e9819
106 594db456-0685-bd16-f59b-e72ee90e9819
109 <screen># cat /sys/fs/lustre/osc/*/uuid
110 594db456-0685-bd16-f59b-e72ee90e9819
111 594db456-0685-bd16-f59b-e72ee90e9819
113 <para>The <literal>llstat</literal> utility can be used to monitor some
114 Lustre file system I/O activity over a specified time period. For more
116 <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438219_23232"/></para>
117 <para>Some data is imported from attached clients and is available in a
118 directory called <literal>exports</literal> located in the corresponding
119 per-service directory on a Lustre server. For example:
120 <screen>oss:/root# lctl list_param obdfilter.testfs-OST0000.exports.*
121 # hash ldlm_stats stats uuid</screen></para>
123 <title>Identifying Lustre File Systems and Servers</title>
124 <para>Several <literal>/proc</literal> files on the MGS list existing
125 Lustre file systems and file system servers. The examples below are for
126 a Lustre file system called
127 <literal>testfs</literal> with one MDT and three OSTs.</para>
130 <para> To view all known Lustre file systems, enter:</para>
131 <screen>mgs# lctl get_param mgs.*.filesystems
135 <para> To view the names of the servers in a file system in which least one server is
137 enter:<screen>lctl get_param mgs.*.live.<replaceable><filesystem name></replaceable></screen></para>
138 <para>For example:</para>
139 <screen>mgs# lctl get_param mgs.*.live.testfs
147 Secure RPC Config Rules:
149 imperative_recovery_state:
153 notify_duration_total: 0.001000
154 notify_duation_max: 0.001000
155 notify_count: 4</screen>
158 <para>To view the names of all live servers in the file system as listed in
159 <literal>/proc/fs/lustre/devices</literal>, enter:</para>
160 <screen># lctl device_list
162 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
163 2 UP mdt MDS MDS_uuid 3
164 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
165 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 7
166 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
167 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
168 7 UP lov testfs-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
169 8 UP mdc testfs-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
170 9 UP osc testfs-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
171 10 UP osc testfs-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
172 <para>The information provided on each line includes:</para>
173 <para> - Device number</para>
174 <para> - Device status (UP, INactive, or STopping) </para>
175 <para> - Device name</para>
176 <para> - Device UUID</para>
177 <para> - Reference count (how many users this device has)</para>
180 <para>To display the name of any server, view the device
181 label:<screen>mds# e2label /dev/sda
182 testfs-MDT0000</screen></para>
188 <title>Tuning Multi-Block Allocation (mballoc)</title>
189 <para>Capabilities supported by <literal>mballoc</literal> include:</para>
192 <para> Pre-allocation for single files to help to reduce fragmentation.</para>
195 <para> Pre-allocation for a group of files to enable packing of small files into large,
196 contiguous chunks.</para>
199 <para> Stream allocation to help decrease the seek rate.</para>
202 <para>The following <literal>mballoc</literal> tunables are available:</para>
203 <informaltable frame="all">
205 <colspec colname="c1" colwidth="30*"/>
206 <colspec colname="c2" colwidth="70*"/>
210 <para><emphasis role="bold">Field</emphasis></para>
213 <para><emphasis role="bold">Description</emphasis></para>
221 <literal>mb_max_to_scan</literal></para>
224 <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
225 final decision to avoid a livelock situation.</para>
231 <literal>mb_min_to_scan</literal></para>
234 <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
235 picking the best chunk for allocation. This is useful for small requests to reduce
236 fragmentation of big free chunks.</para>
242 <literal>mb_order2_req</literal></para>
245 <para>For requests equal to 2^N, where N >= <literal>mb_order2_req</literal>, a
246 fast search is done using a base 2 buddy allocation service.</para>
252 <literal>mb_small_req</literal></para>
255 <para><literal>mb_small_req</literal> - Defines (in MB) the upper bound of "small
257 <para><literal>mb_large_req</literal> - Defines (in MB) the lower bound of "large
259 <para>Requests are handled differently based on size:<itemizedlist>
261 <para>< <literal>mb_small_req</literal> - Requests are packed together to
262 form large, aggregated requests.</para>
265 <para>> <literal>mb_small_req</literal> and < <literal>mb_large_req</literal>
266 - Requests are primarily allocated linearly.</para>
269 <para>> <literal>mb_large_req</literal> - Requests are allocated since hard disk
270 seek time is less of a concern in this case.</para>
272 </itemizedlist></para>
273 <para>In general, small requests are combined to create larger requests, which are
274 then placed close to one another to minimize the number of seeks required to access
281 <literal>mb_large_req</literal></para>
287 <literal>mb_prealloc_table</literal></para>
290 <para>A table of values used to preallocate space when a new request is received. By
291 default, the table looks like
292 this:<screen>prealloc_table
293 4 8 16 32 64 128 256 512 1024 2048 </screen></para>
294 <para>When a new request is received, space is preallocated at the next higher
295 increment specified in the table. For example, for requests of less than 4 file
296 system blocks, 4 blocks of space are preallocated; for requests between 4 and 8, 8
297 blocks are preallocated; and so forth</para>
298 <para>Although customized values can be entered in the table, the performance of
299 general usage file systems will not typically be improved by modifying the table (in
300 fact, in ext4 systems, the table values are fixed). However, for some specialized
301 workloads, tuning the <literal>prealloc_table</literal> values may result in smarter
302 preallocation decisions. </para>
308 <literal>mb_group_prealloc</literal></para>
311 <para>The amount of space (in kilobytes) preallocated for groups of small
318 <para>Buddy group cache information found in
319 <literal>/proc/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
320 be useful for assessing on-disk fragmentation. For
321 example:<screen>cat /proc/fs/ldiskfs/loop0/mb_groups
322 #group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9
324 #0 : 2936 2936 1 42 0 [ 0 0 0 1 1 1 1 2 0 1
325 2 0 0 0 ]</screen></para>
326 <para>In this example, the columns show:<itemizedlist>
328 <para>#group number</para>
331 <para>Available blocks in the group</para>
334 <para>Blocks free on a disk</para>
337 <para>Number of free fragments</para>
340 <para>First free block in the group</para>
343 <para>Number of preallocated chunks (not blocks)</para>
346 <para>A series of available chunks of different sizes</para>
348 </itemizedlist></para>
351 <title>Monitoring Lustre File System I/O</title>
352 <para>A number of system utilities are provided to enable collection of data related to I/O
353 activity in a Lustre file system. In general, the data collected describes:</para>
356 <para> Data transfer rates and throughput of inputs and outputs external to the Lustre file
357 system, such as network requests or disk I/O operations performed</para>
360 <para> Data about the throughput or transfer rates of internal Lustre file system data, such
361 as locks or allocations. </para>
365 <para>It is highly recommended that you complete baseline testing for your Lustre file system
366 to determine normal I/O activity for your hardware, network, and system workloads. Baseline
367 data will allow you to easily determine when performance becomes degraded in your system.
368 Two particularly useful baseline statistics are:</para>
371 <para><literal>brw_stats</literal> – Histogram data characterizing I/O requests to the
372 OSTs. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
373 linkend="dbdoclet.50438271_55057"/>.</para>
376 <para><literal>rpc_stats</literal> – Histogram data showing information about RPCs made by
377 clients. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
378 linkend="MonitoringClientRCPStream"/>.</para>
382 <section remap="h3" xml:id="MonitoringClientRCPStream">
384 <primary>proc</primary>
385 <secondary>watching RPC</secondary>
386 </indexterm>Monitoring the Client RPC Stream</title>
387 <para>The <literal>rpc_stats</literal> file contains histogram data showing information about
388 remote procedure calls (RPCs) that have been made since this file was last cleared. The
389 histogram data can be cleared by writing any value into the <literal>rpc_stats</literal>
391 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
392 <screen># lctl get_param osc.testfs-OST0000-osc-ffff810058d2f800.rpc_stats
393 snapshot_time: 1372786692.389858 (secs.usecs)
394 read RPCs in flight: 0
395 write RPCs in flight: 1
396 dio read RPCs in flight: 0
397 dio write RPCs in flight: 0
398 pending write pages: 256
399 pending read pages: 0
402 pages per rpc rpcs % cum % | rpcs % cum %
411 256: 850 100 100 | 18346 99 100
414 rpcs in flight rpcs % cum % | rpcs % cum %
415 0: 691 81 81 | 1740 9 9
416 1: 48 5 86 | 938 5 14
417 2: 29 3 90 | 1059 5 20
418 3: 17 2 92 | 1052 5 26
419 4: 13 1 93 | 920 5 31
420 5: 12 1 95 | 425 2 33
421 6: 10 1 96 | 389 2 35
422 7: 30 3 100 | 11373 61 97
423 8: 0 0 100 | 460 2 100
426 offset rpcs % cum % | rpcs % cum %
427 0: 850 100 100 | 18347 99 99
435 128: 0 0 100 | 4 0 100
438 <para>The header information includes:</para>
441 <para><literal>snapshot_time</literal> - UNIX epoch instant the file was read.</para>
444 <para><literal>read RPCs in flight</literal> - Number of read RPCs issued by the OSC, but
445 not complete at the time of the snapshot. This value should always be less than or equal
446 to <literal>max_rpcs_in_flight</literal>.</para>
449 <para><literal>write RPCs in flight</literal> - Number of write RPCs issued by the OSC,
450 but not complete at the time of the snapshot. This value should always be less than or
451 equal to <literal>max_rpcs_in_flight</literal>.</para>
454 <para><literal>dio read RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
455 read RPCs issued but not completed at the time of the snapshot.</para>
458 <para><literal>dio write RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
459 write RPCs issued but not completed at the time of the snapshot.</para>
462 <para><literal>pending write pages</literal> - Number of pending write pages that have
463 been queued for I/O in the OSC.</para>
466 <para><literal>pending read pages</literal> - Number of pending read pages that have been
467 queued for I/O in the OSC.</para>
470 <para>The tabular data is described in the table below. Each row in the table shows the number
471 of reads or writes (<literal>ios</literal>) occurring for the statistic, the relative
472 percentage (<literal>%</literal>) of total reads or writes, and the cumulative percentage
473 (<literal>cum %</literal>) to that point in the table for the statistic.</para>
474 <informaltable frame="all">
476 <colspec colname="c1" colwidth="40*"/>
477 <colspec colname="c2" colwidth="60*"/>
481 <para><emphasis role="bold">Field</emphasis></para>
484 <para><emphasis role="bold">Description</emphasis></para>
491 <para> pages per RPC</para>
494 <para>Shows cumulative RPC reads and writes organized according to the number of
495 pages in the RPC. A single page RPC increments the <literal>0:</literal>
501 <para> RPCs in flight</para>
504 <para> Shows the number of RPCs that are pending when an RPC is sent. When the first
505 RPC is sent, the <literal>0:</literal> row is incremented. If the first RPC is
506 sent while another RPC is pending, the <literal>1:</literal> row is incremented
515 <para> The page index of the first page read from or written to the object by the
522 <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
523 <para>This table provides a way to visualize the concurrency of the RPC stream. Ideally, you
524 will see a large clump around the <literal>max_rpcs_in_flight value</literal>, which shows
525 that the network is being kept busy.</para>
526 <para>For information about optimizing the client I/O RPC stream, see <xref
527 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="TuningClientIORPCStream"/>.</para>
529 <section xml:id="lustreproc.clientstats" remap="h3">
531 <primary>proc</primary>
532 <secondary>client stats</secondary>
533 </indexterm>Monitoring Client Activity</title>
534 <para>The <literal>stats</literal> file maintains statistics accumulate during typical
535 operation of a client across the VFS interface of the Lustre file system. Only non-zero
536 parameters are displayed in the file. </para>
537 <para>Client statistics are enabled by default.</para>
539 <para>Statistics for all mounted file systems can be discovered by
540 entering:<screen>lctl get_param llite.*.stats</screen></para>
542 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
543 <screen>client# lctl get_param llite.*.stats
544 snapshot_time 1308343279.169704 secs.usecs
545 dirty_pages_hits 14819716 samples [regs]
546 dirty_pages_misses 81473472 samples [regs]
547 read_bytes 36502963 samples [bytes] 1 26843582 55488794
548 write_bytes 22985001 samples [bytes] 0 125912 3379002
549 brw_read 2279 samples [pages] 1 1 2270
550 ioctl 186749 samples [regs]
551 open 3304805 samples [regs]
552 close 3331323 samples [regs]
553 seek 48222475 samples [regs]
554 fsync 963 samples [regs]
555 truncate 9073 samples [regs]
556 setxattr 19059 samples [regs]
557 getxattr 61169 samples [regs]
559 <para> The statistics can be cleared by echoing an empty string into the
560 <literal>stats</literal> file or by using the command:
561 <screen>lctl set_param llite.*.stats=0</screen></para>
562 <para>The statistics displayed are described in the table below.</para>
563 <informaltable frame="all">
565 <colspec colname="c1" colwidth="3*"/>
566 <colspec colname="c2" colwidth="7*"/>
570 <para><emphasis role="bold">Entry</emphasis></para>
573 <para><emphasis role="bold">Description</emphasis></para>
581 <literal>snapshot_time</literal></para>
584 <para>UNIX epoch instant the stats file was read.</para>
590 <literal>dirty_page_hits</literal></para>
593 <para>The number of write operations that have been satisfied by the dirty page
594 cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
595 linkend="TuningClientIORPCStream"/> for more information about dirty cache
596 behavior in a Lustre file system.</para>
602 <literal>dirty_page_misses</literal></para>
605 <para>The number of write operations that were not satisfied by the dirty page
612 <literal>read_bytes</literal></para>
615 <para>The number of read operations that have occurred. Three additional parameters
616 are displayed:</para>
621 <para>The minimum number of bytes read in a single request since the counter
628 <para>The maximum number of bytes read in a single request since the counter
635 <para>The accumulated sum of bytes of all read requests since the counter was
645 <literal>write_bytes</literal></para>
648 <para>The number of write operations that have occurred. Three additional parameters
649 are displayed:</para>
654 <para>The minimum number of bytes written in a single request since the
655 counter was reset.</para>
661 <para>The maximum number of bytes written in a single request since the
662 counter was reset.</para>
668 <para>The accumulated sum of bytes of all write requests since the counter was
678 <literal>brw_read</literal></para>
681 <para>The number of pages that have been read. Three additional parameters are
687 <para>The minimum number of bytes read in a single block read/write
688 (<literal>brw</literal>) read request since the counter was reset.</para>
694 <para>The maximum number of bytes read in a single <literal>brw</literal> read
695 requests since the counter was reset.</para>
701 <para>The accumulated sum of bytes of all <literal>brw</literal> read requests
702 since the counter was reset.</para>
711 <literal>ioctl</literal></para>
714 <para>The number of combined file and directory <literal>ioctl</literal>
721 <literal>open</literal></para>
724 <para>The number of open operations that have succeeded.</para>
730 <literal>close</literal></para>
733 <para>The number of close operations that have succeeded.</para>
739 <literal>seek</literal></para>
742 <para>The number of times <literal>seek</literal> has been called.</para>
748 <literal>fsync</literal></para>
751 <para>The number of times <literal>fsync</literal> has been called.</para>
757 <literal>truncate</literal></para>
760 <para>The total number of calls to both locked and lockless
761 <literal>truncate</literal>.</para>
767 <literal>setxattr</literal></para>
770 <para>The number of times extended attributes have been set. </para>
776 <literal>getxattr</literal></para>
779 <para>The number of times value(s) of extended attributes have been fetched.</para>
785 <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
786 <para>Information is provided about the amount and type of I/O activity is taking place on the
791 <primary>proc</primary>
792 <secondary>read/write survey</secondary>
793 </indexterm>Monitoring Client Read-Write Offset Statistics</title>
794 <para>When the <literal>offset_stats</literal> parameter is set, statistics are maintained for
795 occurrences of a series of read or write calls from a process that did not access the next
796 sequential location. The <literal>OFFSET</literal> field is reset to 0 (zero) whenever a
797 different file is read or written.</para>
799 <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
800 <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
801 to reduce monitoring overhead when this information is not needed. The collection of
802 statistics in all three of these files is activated by writing
803 anything, except for 0 (zero) and "disable", into any one of the
806 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
807 <screen># lctl get_param llite.testfs-f57dee0.offset_stats
808 snapshot_time: 1155748884.591028 (secs.usecs)
809 RANGE RANGE SMALLEST LARGEST
810 R/W PID START END EXTENT EXTENT OFFSET
811 R 8385 0 128 128 128 0
812 R 8385 0 224 224 224 -128
813 W 8385 0 250 50 100 0
814 W 8385 100 1110 10 500 -150
815 W 8384 0 5233 5233 5233 0
816 R 8385 500 600 100 100 -610</screen>
817 <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file was
818 read. The tabular data is described in the table below.</para>
819 <para>The <literal>offset_stats</literal> file can be cleared by
820 entering:<screen>lctl set_param llite.*.offset_stats=0</screen></para>
821 <informaltable frame="all">
823 <colspec colname="c1" colwidth="50*"/>
824 <colspec colname="c2" colwidth="50*"/>
828 <para><emphasis role="bold">Field</emphasis></para>
831 <para><emphasis role="bold">Description</emphasis></para>
841 <para>Indicates if the non-sequential call was a read or write</para>
849 <para>Process ID of the process that made the read/write call.</para>
854 <para>RANGE START/RANGE END</para>
857 <para>Range in which the read/write calls were sequential.</para>
862 <para>SMALLEST EXTENT </para>
865 <para>Smallest single read/write in the corresponding range (in bytes).</para>
870 <para>LARGEST EXTENT </para>
873 <para>Largest single read/write in the corresponding range (in bytes).</para>
881 <para>Difference between the previous range end and the current range start.</para>
887 <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
888 <para>This data provides an indication of how contiguous or fragmented the data is. For
889 example, the fourth entry in the example above shows the writes for this RPC were sequential
890 in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes.
891 The range started with an offset of -150 from the <literal>RANGE END</literal> of the
892 previous entry in the example.</para>
896 <primary>proc</primary>
897 <secondary>read/write survey</secondary>
898 </indexterm>Monitoring Client Read-Write Extent Statistics</title>
899 <para>For in-depth troubleshooting, client read-write extent statistics can be accessed to
900 obtain more detail about read/write I/O extents for the file system or for a particular
903 <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
904 <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
905 to reduce monitoring overhead when this information is not needed. The collection of
906 statistics in all three of these files is activated by writing
907 anything, except for 0 (zero) and "disable", into any one of the
911 <title>Client-Based I/O Extent Size Survey</title>
912 <para>The <literal>extents_stats</literal> histogram in the
913 <literal>llite</literal> directory shows the statistics for the sizes
914 of the read/write I/O extents. This file does not maintain the per
915 process statistics.</para>
916 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
917 <screen># lctl get_param llite.testfs-*.extents_stats
918 snapshot_time: 1213828728.348516 (secs.usecs)
920 extents calls % cum% | calls % cum%
922 0K - 4K : 0 0 0 | 2 2 2
923 4K - 8K : 0 0 0 | 0 0 2
924 8K - 16K : 0 0 0 | 0 0 2
925 16K - 32K : 0 0 0 | 20 23 26
926 32K - 64K : 0 0 0 | 0 0 26
927 64K - 128K : 0 0 0 | 51 60 86
928 128K - 256K : 0 0 0 | 0 0 86
929 256K - 512K : 0 0 0 | 0 0 86
930 512K - 1024K : 0 0 0 | 0 0 86
931 1M - 2M : 0 0 0 | 11 13 100</screen>
932 <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file
933 was read. The table shows cumulative extents organized according to size with statistics
934 provided separately for reads and writes. Each row in the table shows the number of RPCs
935 for reads and writes respectively (<literal>calls</literal>), the relative percentage of
936 total calls (<literal>%</literal>), and the cumulative percentage to
937 that point in the table of calls (<literal>cum %</literal>). </para>
938 <para> The file can be cleared by issuing the following command:
939 <screen># lctl set_param llite.testfs-*.extents_stats=1</screen></para>
942 <title>Per-Process Client I/O Statistics</title>
943 <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
944 statistics on a per-process basis.</para>
945 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
946 <screen># lctl get_param llite.testfs-*.extents_stats_per_process
947 snapshot_time: 1213828762.204440 (secs.usecs)
949 extents calls % cum% | calls % cum%
952 0K - 4K : 0 0 0 | 0 0 0
953 4K - 8K : 0 0 0 | 0 0 0
954 8K - 16K : 0 0 0 | 0 0 0
955 16K - 32K : 0 0 0 | 0 0 0
956 32K - 64K : 0 0 0 | 0 0 0
957 64K - 128K : 0 0 0 | 0 0 0
958 128K - 256K : 0 0 0 | 0 0 0
959 256K - 512K : 0 0 0 | 0 0 0
960 512K - 1024K : 0 0 0 | 0 0 0
961 1M - 2M : 0 0 0 | 10 100 100
964 0K - 4K : 0 0 0 | 0 0 0
965 4K - 8K : 0 0 0 | 0 0 0
966 8K - 16K : 0 0 0 | 0 0 0
967 16K - 32K : 0 0 0 | 20 100 100
970 0K - 4K : 0 0 0 | 0 0 0
971 4K - 8K : 0 0 0 | 0 0 0
972 8K - 16K : 0 0 0 | 0 0 0
973 16K - 32K : 0 0 0 | 0 0 0
974 32K - 64K : 0 0 0 | 0 0 0
975 64K - 128K : 0 0 0 | 16 100 100
978 0K - 4K : 0 0 0 | 1 100 100
981 0K - 4K : 0 0 0 | 1 100 100
984 <para>This table shows cumulative extents organized according to size for each process ID
985 (PID) with statistics provided separately for reads and writes. Each row in the table
986 shows the number of RPCs for reads and writes respectively (<literal>calls</literal>), the
987 relative percentage of total calls (<literal>%</literal>), and the cumulative percentage
988 to that point in the table of calls (<literal>cum %</literal>). </para>
991 <section xml:id="dbdoclet.50438271_55057">
993 <primary>proc</primary>
994 <secondary>block I/O</secondary>
995 </indexterm>Monitoring the OST Block I/O Stream</title>
996 <para>The <literal>brw_stats</literal> file in the <literal>obdfilter</literal> directory
997 contains histogram data showing statistics for number of I/O requests sent to the disk,
998 their size, and whether they are contiguous on the disk or not.</para>
999 <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
1000 <para>Enter on the OSS:</para>
1001 <screen># lctl get_param obdfilter.testfs-OST0000.brw_stats
1002 snapshot_time: 1372775039.769045 (secs.usecs)
1004 pages per bulk r/w rpcs % cum % | rpcs % cum %
1005 1: 108 100 100 | 39 0 0
1010 32: 0 0 100 | 17 0 0
1011 64: 0 0 100 | 12 0 0
1012 128: 0 0 100 | 24 0 0
1013 256: 0 0 100 | 23142 99 100
1016 discontiguous pages rpcs % cum % | rpcs % cum %
1017 0: 108 100 100 | 23245 100 100
1020 discontiguous blocks rpcs % cum % | rpcs % cum %
1021 0: 108 100 100 | 23243 99 99
1022 1: 0 0 100 | 2 0 100
1025 disk fragmented I/Os ios % cum % | ios % cum %
1027 1: 14 12 100 | 23243 99 99
1028 2: 0 0 100 | 2 0 100
1031 disk I/Os in flight ios % cum % | ios % cum %
1032 1: 14 100 100 | 20896 89 89
1033 2: 0 0 100 | 1071 4 94
1034 3: 0 0 100 | 573 2 96
1035 4: 0 0 100 | 300 1 98
1036 5: 0 0 100 | 166 0 98
1037 6: 0 0 100 | 108 0 99
1038 7: 0 0 100 | 81 0 99
1039 8: 0 0 100 | 47 0 99
1040 9: 0 0 100 | 5 0 100
1043 I/O time (1/1000s) ios % cum % | ios % cum %
1046 4: 14 12 100 | 27 0 0
1048 16: 0 0 100 | 31 0 0
1049 32: 0 0 100 | 38 0 0
1050 64: 0 0 100 | 18979 81 82
1051 128: 0 0 100 | 943 4 86
1052 256: 0 0 100 | 1233 5 91
1053 512: 0 0 100 | 1825 7 99
1054 1K: 0 0 100 | 99 0 99
1055 2K: 0 0 100 | 0 0 99
1056 4K: 0 0 100 | 0 0 99
1057 8K: 0 0 100 | 49 0 100
1060 disk I/O size ios % cum % | ios % cum %
1061 4K: 14 100 100 | 41 0 0
1063 16K: 0 0 100 | 1 0 0
1064 32K: 0 0 100 | 0 0 0
1065 64K: 0 0 100 | 4 0 0
1066 128K: 0 0 100 | 17 0 0
1067 256K: 0 0 100 | 12 0 0
1068 512K: 0 0 100 | 24 0 0
1069 1M: 0 0 100 | 23142 99 100
1071 <para>The tabular data is described in the table below. Each row in the table shows the number
1072 of reads and writes occurring for the statistic (<literal>ios</literal>), the relative
1073 percentage of total reads or writes (<literal>%</literal>), and the cumulative percentage to
1074 that point in the table for the statistic (<literal>cum %</literal>). </para>
1075 <informaltable frame="all">
1077 <colspec colname="c1" colwidth="40*"/>
1078 <colspec colname="c2" colwidth="60*"/>
1082 <para><emphasis role="bold">Field</emphasis></para>
1085 <para><emphasis role="bold">Description</emphasis></para>
1093 <literal>pages per bulk r/w</literal></para>
1096 <para>Number of pages per RPC request, which should match aggregate client
1097 <literal>rpc_stats</literal> (see <xref
1098 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"
1105 <literal>discontiguous pages</literal></para>
1108 <para>Number of discontinuities in the logical file offset of each page in a single
1115 <literal>discontiguous blocks</literal></para>
1118 <para>Number of discontinuities in the physical block allocation in the file system
1119 for a single RPC.</para>
1124 <para><literal>disk fragmented I/Os</literal></para>
1127 <para>Number of I/Os that were not written entirely sequentially.</para>
1132 <para><literal>disk I/Os in flight</literal></para>
1135 <para>Number of disk I/Os currently pending.</para>
1140 <para><literal>I/O time (1/1000s)</literal></para>
1143 <para>Amount of time for each I/O operation to complete.</para>
1148 <para><literal>disk I/O size</literal></para>
1151 <para>Size of each I/O operation.</para>
1157 <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
1158 <para>This data provides an indication of extent size and distribution in the file
1163 <title>Tuning Lustre File System I/O</title>
1164 <para>Each OSC has its own tree of tunables. For example:</para>
1165 <screen>$ lctl lctl list_param osc.*.*
1166 osc.myth-OST0000-osc-ffff8804296c2800.active
1167 osc.myth-OST0000-osc-ffff8804296c2800.blocksize
1168 osc.myth-OST0000-osc-ffff8804296c2800.checksum_dump
1169 osc.myth-OST0000-osc-ffff8804296c2800.checksum_type
1170 osc.myth-OST0000-osc-ffff8804296c2800.checksums
1171 osc.myth-OST0000-osc-ffff8804296c2800.connect_flags
1174 osc.myth-OST0000-osc-ffff8804296c2800.state
1175 osc.myth-OST0000-osc-ffff8804296c2800.stats
1176 osc.myth-OST0000-osc-ffff8804296c2800.timeouts
1177 osc.myth-OST0000-osc-ffff8804296c2800.unstable_stats
1178 osc.myth-OST0000-osc-ffff8804296c2800.uuid
1179 osc.myth-OST0001-osc-ffff8804296c2800.active
1180 osc.myth-OST0001-osc-ffff8804296c2800.blocksize
1181 osc.myth-OST0001-osc-ffff8804296c2800.checksum_dump
1182 osc.myth-OST0001-osc-ffff8804296c2800.checksum_type
1186 <para>The following sections describe some of the parameters that can
1187 be tuned in a Lustre file system.</para>
1188 <section remap="h3" xml:id="TuningClientIORPCStream">
1190 <primary>proc</primary>
1191 <secondary>RPC tunables</secondary>
1192 </indexterm>Tuning the Client I/O RPC Stream</title>
1193 <para>Ideally, an optimal amount of data is packed into each I/O RPC
1194 and a consistent number of issued RPCs are in progress at any time.
1195 To help optimize the client I/O RPC stream, several tuning variables
1196 are provided to adjust behavior according to network conditions and
1197 cluster size. For information about monitoring the client I/O RPC
1199 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"/>.</para>
1200 <para>RPC stream tunables include:</para>
1204 <para><literal>osc.<replaceable>osc_instance</replaceable>.checksums</literal>
1205 - Controls whether the client will calculate data integrity
1206 checksums for the bulk data transferred to the OST. Data
1207 integrity checksums are enabled by default. The algorithm used
1208 can be set using the <literal>checksum_type</literal> parameter.
1212 <para><literal>osc.<replaceable>osc_instance</replaceable>.checksum_type</literal>
1213 - Controls the data integrity checksum algorithm used by the
1214 client. The available algorithms are determined by the set of
1215 algorihtms. The checksum algorithm used by default is determined
1216 by first selecting the fastest algorithms available on the OST,
1217 and then selecting the fastest of those algorithms on the client,
1218 which depends on available optimizations in the CPU hardware and
1219 kernel. The default algorithm can be overridden by writing the
1220 algorithm name into the <literal>checksum_type</literal>
1221 parameter. Available checksum types can be seen on the client by
1222 reading the <literal>checksum_type</literal> parameter. Currently
1223 supported checksum types are:
1224 <literal>adler</literal>,
1225 <literal>crc32</literal>,
1226 <literal>crc32c</literal>
1228 <para condition="l2C">
1229 In Lustre release 2.12 additional checksum types were added to
1230 allow end-to-end checksum integration with T10-PI capable
1231 hardware. The client will compute the appropriate checksum
1232 type, based on the checksum type used by the storage, for the
1233 RPC checksum, which will be verified by the server and passed
1234 on to the storage. The T10-PI checksum types are:
1235 <literal>t10ip512</literal>,
1236 <literal>t10ip4K</literal>,
1237 <literal>t10crc512</literal>,
1238 <literal>t10crc4K</literal>
1242 <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal>
1243 - Controls how many MiB of dirty data can be written into the
1244 client pagecache for writes by <emphasis>each</emphasis> OSC.
1245 When this limit is reached, additional writes block until
1246 previously-cached data is written to the server. This may be
1247 changed by the <literal>lctl set_param</literal> command. Only
1248 values larger than 0 and smaller than the lesser of 2048 MiB or
1249 1/4 of client RAM are valid. Performance can suffers if the
1250 client cannot aggregate enough data per OSC to form a full RPC
1251 (as set by the <literal>max_pages_per_rpc</literal>) parameter,
1252 unless the application is doing very large writes itself.
1254 <para>To maximize performance, the value for
1255 <literal>max_dirty_mb</literal> is recommended to be at least
1256 4 * <literal>max_pages_per_rpc</literal> *
1257 <literal>max_rpcs_in_flight</literal>.
1261 <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal>
1262 - A read-only value that returns the current number of bytes
1263 written and cached by this OSC.
1267 <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal>
1268 - The maximum number of pages that will be sent in a single RPC
1269 request to the OST. The minimum value is one page and the maximum
1270 value is 16 MiB (4096 on systems with <literal>PAGE_SIZE</literal>
1271 of 4 KiB), with the default value of 4 MiB in one RPC. The upper
1272 limit may also be constrained by <literal>ofd.*.brw_size</literal>
1273 setting on the OSS, and applies to all clients connected to that
1274 OST. It is also possible to specify a units suffix (e.g.
1275 <literal>max_pages_per_rpc=4M</literal>), so the RPC size can be
1276 set independently of the client <literal>PAGE_SIZE</literal>.
1280 <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
1281 - The maximum number of concurrent RPCs in flight from an OSC to
1282 its OST. If the OSC tries to initiate an RPC but finds that it
1283 already has the same number of RPCs outstanding, it will wait to
1284 issue further RPCs until some complete. The minimum setting is 1
1285 and maximum setting is 256. The default value is 8 RPCs.
1287 <para>To improve small file I/O performance, increase the
1288 <literal>max_rpcs_in_flight</literal> value.
1292 <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_cache_mb</literal>
1293 - Maximum amount of inactive data cached by the client. The
1294 default value is 3/4 of the client RAM.
1300 <para>The value for <literal><replaceable>osc_instance</replaceable></literal>
1301 and <literal><replaceable>fsname_instance</replaceable></literal>
1302 are unique to each mount point to allow associating osc, mdc, lov,
1303 lmv, and llite parameters with the same mount point. However, it is
1304 common for scripts to use a wildcard <literal>*</literal> or a
1305 filesystem-specific wildcard
1306 <literal><replaceable>fsname-*</replaceable></literal> to specify
1307 the parameter settings uniformly on all clients. For example:
1309 client$ lctl get_param osc.testfs-OST0000*.rpc_stats
1310 osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats=
1311 snapshot_time: 1375743284.337839 (secs.usecs)
1312 read RPCs in flight: 0
1313 write RPCs in flight: 0
1317 <section remap="h3" xml:id="TuningClientReadahead">
1319 <primary>proc</primary>
1320 <secondary>readahead</secondary>
1321 </indexterm>Tuning File Readahead and Directory Statahead</title>
1322 <para>File readahead and directory statahead enable reading of data
1323 into memory before a process requests the data. File readahead prefetches
1324 file content data into memory for <literal>read()</literal> related
1325 calls, while directory statahead fetches file metadata into memory for
1326 <literal>readdir()</literal> and <literal>stat()</literal> related
1327 calls. When readahead and statahead work well, a process that accesses
1328 data finds that the information it needs is available immediately in
1329 memory on the client when requested without the delay of network I/O.
1331 <section remap="h4">
1332 <title>Tuning File Readahead</title>
1333 <para>File readahead is triggered when two or more sequential reads
1334 by an application fail to be satisfied by data in the Linux buffer
1335 cache. The size of the initial readahead is determined by the RPC
1336 size and the file stripe size, but will typically be at least 1 MiB.
1337 Additional readaheads grow linearly and increment until the per-file
1338 or per-system readahead cache limit on the client is reached.</para>
1339 <para>Readahead tunables include:</para>
1342 <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_mb</literal>
1343 - Controls the maximum amount of data readahead on a file.
1344 Files are read ahead in RPC-sized chunks (4 MiB, or the size of
1345 the <literal>read()</literal> call, if larger) after the second
1346 sequential read on a file descriptor. Random reads are done at
1347 the size of the <literal>read()</literal> call only (no
1348 readahead). Reads to non-contiguous regions of the file reset
1349 the readahead algorithm, and readahead is not triggered until
1350 sequential reads take place again.
1353 This is the global limit for all files and cannot be larger than
1354 1/2 of the client RAM. To disable readahead, set
1355 <literal>max_read_ahead_mb=0</literal>.
1359 <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_per_file_mb</literal>
1360 - Controls the maximum number of megabytes (MiB) of data that
1361 should be prefetched by the client when sequential reads are
1362 detected on a file. This is the per-file readahead limit and
1363 cannot be larger than <literal>max_read_ahead_mb</literal>.
1367 <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_whole_mb</literal>
1368 - Controls the maximum size of a file in MiB that is read in its
1369 entirety upon access, regardless of the size of the
1370 <literal>read()</literal> call. This avoids multiple small read
1371 RPCs on relatively small files, when it is not possible to
1372 efficiently detect a sequential read pattern before the whole
1375 <para>The default value is the greater of 2 MiB or the size of one
1376 RPC, as given by <literal>max_pages_per_rpc</literal>.
1382 <title>Tuning Directory Statahead and AGL</title>
1383 <para>Many system commands, such as <literal>ls –l</literal>,
1384 <literal>du</literal>, and <literal>find</literal>, traverse a
1385 directory sequentially. To make these commands run efficiently, the
1386 directory statahead can be enabled to improve the performance of
1387 directory traversal.</para>
1388 <para>The statahead tunables are:</para>
1391 <para><literal>statahead_max</literal> -
1392 Controls the maximum number of file attributes that will be
1393 prefetched by the statahead thread. By default, statahead is
1394 enabled and <literal>statahead_max</literal> is 32 files.</para>
1395 <para>To disable statahead, set <literal>statahead_max</literal>
1396 to zero via the following command on the client:</para>
1397 <screen>lctl set_param llite.*.statahead_max=0</screen>
1398 <para>To change the maximum statahead window size on a client:</para>
1399 <screen>lctl set_param llite.*.statahead_max=<replaceable>n</replaceable></screen>
1400 <para>The maximum <literal>statahead_max</literal> is 8192 files.
1402 <para>The directory statahead thread will also prefetch the file
1403 size/block attributes from the OSTs, so that all file attributes
1404 are available on the client when requested by an application.
1405 This is controlled by the asynchronous glimpse lock (AGL) setting.
1406 The AGL behaviour can be disabled by setting:</para>
1407 <screen>lctl set_param llite.*.statahead_agl=0</screen>
1410 <para><literal>statahead_stats</literal> -
1411 A read-only interface that provides current statahead and AGL
1412 statistics, such as how many times statahead/AGL has been triggered
1413 since the last mount, how many statahead/AGL failures have occurred
1414 due to an incorrect prediction or other causes.</para>
1416 <para>AGL behaviour is affected by statahead since the inodes
1417 processed by AGL are built by the statahead thread. If
1418 statahead is disabled, then AGL is also disabled.</para>
1424 <section remap="h3">
1426 <primary>proc</primary>
1427 <secondary>read cache</secondary>
1428 </indexterm>Tuning OSS Read Cache</title>
1429 <para>The OSS read cache feature provides read-only caching of data on an OSS. This
1430 functionality uses the Linux page cache to store the data and uses as much physical memory
1431 as is allocated.</para>
1432 <para>OSS read cache improves Lustre file system performance in these situations:</para>
1435 <para>Many clients are accessing the same data set (as in HPC applications or when
1436 diskless clients boot from the Lustre file system).</para>
1439 <para>One client is storing data while another client is reading it (i.e., clients are
1440 exchanging data via the OST).</para>
1443 <para>A client has very limited caching of its own.</para>
1446 <para>OSS read cache offers these benefits:</para>
1449 <para>Allows OSTs to cache read data more frequently.</para>
1452 <para>Improves repeated reads to match network speeds instead of disk speeds.</para>
1455 <para>Provides the building blocks for OST write cache (small-write aggregation).</para>
1458 <section remap="h4">
1459 <title>Using OSS Read Cache</title>
1460 <para>OSS read cache is implemented on the OSS, and does not require any special support on
1461 the client side. Since OSS read cache uses the memory available in the Linux page cache,
1462 the appropriate amount of memory for the cache should be determined based on I/O patterns;
1463 if the data is mostly reads, then more cache is required than would be needed for mostly
1465 <para>OSS read cache is managed using the following tunables:</para>
1468 <para><literal>read_cache_enable</literal> - Controls whether data read from disk during
1469 a read request is kept in memory and available for later read requests for the same
1470 data, without having to re-read it from disk. By default, read cache is enabled
1471 (<literal>read_cache_enable=1</literal>).</para>
1472 <para>When the OSS receives a read request from a client, it reads data from disk into
1473 its memory and sends the data as a reply to the request. If read cache is enabled,
1474 this data stays in memory after the request from the client has been fulfilled. When
1475 subsequent read requests for the same data are received, the OSS skips reading data
1476 from disk and the request is fulfilled from the cached data. The read cache is managed
1477 by the Linux kernel globally across all OSTs on that OSS so that the least recently
1478 used cache pages are dropped from memory when the amount of free memory is running
1480 <para>If read cache is disabled (<literal>read_cache_enable=0</literal>), the OSS
1481 discards the data after a read request from the client is serviced and, for subsequent
1482 read requests, the OSS again reads the data from disk.</para>
1483 <para>To disable read cache on all the OSTs of an OSS, run:</para>
1484 <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
1485 <para>To re-enable read cache on one OST, run:</para>
1486 <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
1487 <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
1488 <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
1491 <para><literal>writethrough_cache_enable</literal> - Controls whether data sent to the
1492 OSS as a write request is kept in the read cache and available for later reads, or if
1493 it is discarded from cache when the write is completed. By default, the writethrough
1494 cache is enabled (<literal>writethrough_cache_enable=1</literal>).</para>
1495 <para>When the OSS receives write requests from a client, it receives data from the
1496 client into its memory and writes the data to disk. If the writethrough cache is
1497 enabled, this data stays in memory after the write request is completed, allowing the
1498 OSS to skip reading this data from disk if a later read request, or partial-page write
1499 request, for the same data is received.</para>
1500 <para>If the writethrough cache is disabled
1501 (<literal>writethrough_cache_enabled=0</literal>), the OSS discards the data after
1502 the write request from the client is completed. For subsequent read requests, or
1503 partial-page write requests, the OSS must re-read the data from disk.</para>
1504 <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
1505 writes that would cause partial-page updates, or if the files written by one node are
1506 immediately being accessed by other nodes. Some examples where enabling writethrough
1507 cache might be useful include producer-consumer I/O models or shared-file writes with
1508 a different node doing I/O not aligned on 4096-byte boundaries. </para>
1509 <para>Disabling the writethrough cache is advisable when files are mostly written to the
1510 file system but are not re-read within a short time period, or files are only written
1511 and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
1512 <para>To disable the writethrough cache on all OSTs of an OSS, run:</para>
1513 <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
1514 <para>To re-enable the writethrough cache on one OST, run:</para>
1515 <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
1516 <para>To check if the writethrough cache is enabled, run:</para>
1517 <screen>root@oss1# lctl get_param obdfilter.*.writethrough_cache_enable</screen>
1520 <para><literal>readcache_max_filesize</literal> - Controls the maximum size of a file
1521 that both the read cache and writethrough cache will try to keep in memory. Files
1522 larger than <literal>readcache_max_filesize</literal> will not be kept in cache for
1523 either reads or writes.</para>
1524 <para>Setting this tunable can be useful for workloads where relatively small files are
1525 repeatedly accessed by many clients, such as job startup files, executables, log
1526 files, etc., but large files are read or written only once. By not putting the larger
1527 files into the cache, it is much more likely that more of the smaller files will
1528 remain in cache for a longer time.</para>
1529 <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
1530 specified in bytes, or can have a suffix to indicate other binary units such as
1531 <literal>K</literal> (kilobytes), <literal>M</literal> (megabytes),
1532 <literal>G</literal> (gigabytes), <literal>T</literal> (terabytes), or
1533 <literal>P</literal> (petabytes).</para>
1534 <para>To limit the maximum cached file size to 32 MB on all OSTs of an OSS, run:</para>
1535 <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
1536 <para>To disable the maximum cached file size on an OST, run:</para>
1537 <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
1538 <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
1539 <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
1546 <primary>proc</primary>
1547 <secondary>OSS journal</secondary>
1548 </indexterm>Enabling OSS Asynchronous Journal Commit</title>
1549 <para>The OSS asynchronous journal commit feature asynchronously writes data to disk without
1550 forcing a journal flush. This reduces the number of seeks and significantly improves
1551 performance on some hardware.</para>
1553 <para>Asynchronous journal commit cannot work with direct I/O-originated writes
1554 (<literal>O_DIRECT</literal> flag set). In this case, a journal flush is forced. </para>
1556 <para>When the asynchronous journal commit feature is enabled, client nodes keep data in the
1557 page cache (a page reference). Lustre clients monitor the last committed transaction number
1558 (<literal>transno</literal>) in messages sent from the OSS to the clients. When a client
1559 sees that the last committed <literal>transno</literal> reported by the OSS is at least
1560 equal to the bulk write <literal>transno</literal>, it releases the reference on the
1561 corresponding pages. To avoid page references being held for too long on clients after a
1562 bulk write, a 7 second ping request is scheduled (the default OSS file system commit time
1563 interval is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity
1564 to report the last committed <literal>transno</literal>.</para>
1565 <para>If the OSS crashes before the journal commit occurs, then intermediate data is lost.
1566 However, OSS recovery functionality incorporated into the asynchronous journal commit
1567 feature causes clients to replay their write requests and compensate for the missing disk
1568 updates by restoring the state of the file system.</para>
1569 <para>By default, <literal>sync_journal</literal> is enabled
1570 (<literal>sync_journal=1</literal>), so that journal entries are committed synchronously.
1571 To enable asynchronous journal commit, set the <literal>sync_journal</literal> parameter to
1572 <literal>0</literal> by entering: </para>
1573 <screen>$ lctl set_param obdfilter.*.sync_journal=0
1574 obdfilter.lol-OST0001.sync_journal=0</screen>
1575 <para>An associated <literal>sync-on-lock-cancel</literal> feature (enabled by default)
1576 addresses a data consistency issue that can result if an OSS crashes after multiple clients
1577 have written data into intersecting regions of an object, and then one of the clients also
1578 crashes. A condition is created in which the POSIX requirement for continuous writes is
1579 violated along with a potential for corrupted data. With
1580 <literal>sync-on-lock-cancel</literal> enabled, if a cancelled lock has any volatile
1581 writes attached to it, the OSS synchronously writes the journal to disk on lock
1582 cancellation. Disabling the <literal>sync-on-lock-cancel</literal> feature may enhance
1583 performance for concurrent write workloads, but it is recommended that you not disable this
1585 <para> The <literal>sync_on_lock_cancel</literal> parameter can be set to the following
1589 <para><literal>always</literal> - Always force a journal flush on lock cancellation
1590 (default when <literal>async_journal</literal> is enabled).</para>
1593 <para><literal>blocking</literal> - Force a journal flush only when the local cancellation
1594 is due to a blocking callback.</para>
1597 <para><literal>never</literal> - Do not force any journal flush (default when
1598 <literal>async_journal</literal> is disabled).</para>
1601 <para>For example, to set <literal>sync_on_lock_cancel</literal> to not to force a journal
1602 flush, use a command similar to:</para>
1603 <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
1604 obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
1606 <section xml:id="dbdoclet.TuningModRPCs" condition='l28'>
1609 <primary>proc</primary>
1610 <secondary>client metadata performance</secondary>
1612 Tuning the Client Metadata RPC Stream
1614 <para>The client metadata RPC stream represents the metadata RPCs issued
1615 in parallel by a client to a MDT target. The metadata RPCs can be split
1616 in two categories: the requests that do not modify the file system
1617 (like getattr operation), and the requests that do modify the file system
1618 (like create, unlink, setattr operations). To help optimize the client
1619 metadata RPC stream, several tuning variables are provided to adjust
1620 behavior according to network conditions and cluster size.</para>
1621 <para>Note that increasing the number of metadata RPCs issued in parallel
1622 might improve the performance metadata intensive parallel applications,
1623 but as a consequence it will consume more memory on the client and on
1626 <title>Configuring the Client Metadata RPC Stream</title>
1627 <para>The MDC <literal>max_rpcs_in_flight</literal> parameter defines
1628 the maximum number of metadata RPCs, both modifying and
1629 non-modifying RPCs, that can be sent in parallel by a client to a MDT
1630 target. This includes every file system metadata operations, such as
1631 file or directory stat, creation, unlink. The default setting is 8,
1632 minimum setting is 1 and maximum setting is 256.</para>
1633 <para>To set the <literal>max_rpcs_in_flight</literal> parameter, run
1634 the following command on the Lustre client:</para>
1635 <screen>client$ lctl set_param mdc.*.max_rpcs_in_flight=16</screen>
1636 <para>The MDC <literal>max_mod_rpcs_in_flight</literal> parameter
1637 defines the maximum number of file system modifying RPCs that can be
1638 sent in parallel by a client to a MDT target. For example, the Lustre
1639 client sends modify RPCs when it performs file or directory creation,
1640 unlink, access permission modification or ownership modification. The
1641 default setting is 7, minimum setting is 1 and maximum setting is
1643 <para>To set the <literal>max_mod_rpcs_in_flight</literal> parameter,
1644 run the following command on the Lustre client:</para>
1645 <screen>client$ lctl set_param mdc.*.max_mod_rpcs_in_flight=12</screen>
1646 <para>The <literal>max_mod_rpcs_in_flight</literal> value must be
1647 strictly less than the <literal>max_rpcs_in_flight</literal> value.
1648 It must also be less or equal to the MDT
1649 <literal>max_mod_rpcs_per_client</literal> value. If one of theses
1650 conditions is not enforced, the setting fails and an explicit message
1651 is written in the Lustre log.</para>
1652 <para>The MDT <literal>max_mod_rpcs_per_client</literal> parameter is a
1653 tunable of the kernel module <literal>mdt</literal> that defines the
1654 maximum number of file system modifying RPCs in flight allowed per
1655 client. The parameter can be updated at runtime, but the change is
1656 effective to new client connections only. The default setting is 8.
1658 <para>To set the <literal>max_mod_rpcs_per_client</literal> parameter,
1659 run the following command on the MDS:</para>
1660 <screen>mds$ echo 12 > /sys/module/mdt/parameters/max_mod_rpcs_per_client</screen>
1663 <title>Monitoring the Client Metadata RPC Stream</title>
1664 <para>The <literal>rpc_stats</literal> file contains histogram data
1665 showing information about modify metadata RPCs. It can be helpful to
1666 identify the level of parallelism achieved by an application doing
1667 modify metadata operations.</para>
1668 <para><emphasis role="bold">Example:</emphasis></para>
1669 <screen>client$ lctl get_param mdc.*.rpc_stats
1670 snapshot_time: 1441876896.567070 (secs.usecs)
1671 modify_RPCs_in_flight: 0
1674 rpcs in flight rpcs % cum %
1687 12: 4540 18 100</screen>
1688 <para>The file information includes:</para>
1691 <para><literal>snapshot_time</literal> - UNIX epoch instant the
1692 file was read.</para>
1695 <para><literal>modify_RPCs_in_flight</literal> - Number of modify
1696 RPCs issued by the MDC, but not completed at the time of the
1697 snapshot. This value should always be less than or equal to
1698 <literal>max_mod_rpcs_in_flight</literal>.</para>
1701 <para><literal>rpcs in flight</literal> - Number of modify RPCs
1702 that are pending when a RPC is sent, the relative percentage
1703 (<literal>%</literal>) of total modify RPCs, and the cumulative
1704 percentage (<literal>cum %</literal>) to that point.</para>
1707 <para>If a large proportion of modify metadata RPCs are issued with a
1708 number of pending metadata RPCs close to the
1709 <literal>max_mod_rpcs_in_flight</literal> value, it means the
1710 <literal>max_mod_rpcs_in_flight</literal> value could be increased to
1711 improve the modify metadata performance.</para>
1716 <title>Configuring Timeouts in a Lustre File System</title>
1717 <para>In a Lustre file system, RPC timeouts are set using an adaptive timeouts mechanism, which
1718 is enabled by default. Servers track RPC completion times and then report back to clients
1719 estimates for completion times for future RPCs. Clients use these estimates to set RPC
1720 timeout values. If the processing of server requests slows down for any reason, the server
1721 estimates for RPC completion increase, and clients then revise RPC timeout values to allow
1722 more time for RPC completion.</para>
1723 <para>If the RPCs queued on the server approach the RPC timeout specified by the client, to
1724 avoid RPC timeouts and disconnect/reconnect cycles, the server sends an "early reply" to the
1725 client, telling the client to allow more time. Conversely, as server processing speeds up, RPC
1726 timeout values decrease, resulting in faster detection if the server becomes non-responsive
1727 and quicker connection to the failover partner of the server.</para>
1730 <primary>proc</primary>
1731 <secondary>configuring adaptive timeouts</secondary>
1732 </indexterm><indexterm>
1733 <primary>configuring</primary>
1734 <secondary>adaptive timeouts</secondary>
1735 </indexterm><indexterm>
1736 <primary>proc</primary>
1737 <secondary>adaptive timeouts</secondary>
1738 </indexterm>Configuring Adaptive Timeouts</title>
1739 <para>The adaptive timeout parameters in the table below can be set persistently system-wide
1740 using <literal>lctl conf_param</literal> on the MGS. For example, the following command sets
1741 the <literal>at_max</literal> value for all servers and clients associated with the file
1743 <literal>testfs</literal>:<screen>lctl conf_param testfs.sys.at_max=1500</screen></para>
1745 <para>Clients that access multiple Lustre file systems must use the same parameter values
1746 for all file systems.</para>
1748 <informaltable frame="all">
1750 <colspec colname="c1" colwidth="30*"/>
1751 <colspec colname="c2" colwidth="80*"/>
1755 <para><emphasis role="bold">Parameter</emphasis></para>
1758 <para><emphasis role="bold">Description</emphasis></para>
1766 <literal> at_min </literal></para>
1769 <para>Minimum adaptive timeout (in seconds). The default value is 0. The
1770 <literal>at_min</literal> parameter is the minimum processing time that a server
1771 will report. Ideally, <literal>at_min</literal> should be set to its default
1772 value. Clients base their timeouts on this value, but they do not use this value
1774 <para>If, for unknown reasons (usually due to temporary network outages), the
1775 adaptive timeout value is too short and clients time out their RPCs, you can
1776 increase the <literal>at_min</literal> value to compensate for this.</para>
1782 <literal> at_max </literal></para>
1785 <para>Maximum adaptive timeout (in seconds). The <literal>at_max</literal> parameter
1786 is an upper-limit on the service time estimate. If <literal>at_max</literal> is
1787 reached, an RPC request times out.</para>
1788 <para>Setting <literal>at_max</literal> to 0 causes adaptive timeouts to be disabled
1789 and a fixed timeout method to be used instead (see <xref
1790 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/></para>
1792 <para>If slow hardware causes the service estimate to increase beyond the default
1793 value of <literal>at_max</literal>, increase <literal>at_max</literal> to the
1794 maximum time you are willing to wait for an RPC completion.</para>
1801 <literal> at_history </literal></para>
1804 <para>Time period (in seconds) within which adaptive timeouts remember the slowest
1805 event that occurred. The default is 600.</para>
1811 <literal> at_early_margin </literal></para>
1814 <para>Amount of time before the Lustre server sends an early reply (in seconds).
1815 Default is 5.</para>
1821 <literal> at_extra </literal></para>
1824 <para>Incremental amount of time that a server requests with each early reply (in
1825 seconds). The server does not know how much time the RPC will take, so it asks for
1826 a fixed value. The default is 30, which provides a balance between sending too
1827 many early replies for the same RPC and overestimating the actual completion
1829 <para>When a server finds a queued request about to time out and needs to send an
1830 early reply out, the server adds the <literal>at_extra</literal> value. If the
1831 time expires, the Lustre server drops the request, and the client enters recovery
1832 status and reconnects to restore the connection to normal status.</para>
1833 <para>If you see multiple early replies for the same RPC asking for 30-second
1834 increases, change the <literal>at_extra</literal> value to a larger number to cut
1835 down on early replies sent and, therefore, network load.</para>
1841 <literal> ldlm_enqueue_min </literal></para>
1844 <para>Minimum lock enqueue time (in seconds). The default is 100. The time it takes
1845 to enqueue a lock, <literal>ldlm_enqueue</literal>, is the maximum of the measured
1846 enqueue estimate (influenced by <literal>at_min</literal> and
1847 <literal>at_max</literal> parameters), multiplied by a weighting factor and the
1848 value of <literal>ldlm_enqueue_min</literal>. </para>
1849 <para>Lustre Distributed Lock Manager (LDLM) lock enqueues have a dedicated minimum
1850 value for <literal>ldlm_enqueue_min</literal>. Lock enqueue timeouts increase as
1851 the measured enqueue times increase (similar to adaptive timeouts).</para>
1858 <title>Interpreting Adaptive Timeout Information</title>
1859 <para>Adaptive timeout information can be obtained from the <literal>timeouts</literal>
1860 files in <literal>/proc/fs/lustre/*/</literal> on each server and client using the
1861 <literal>lctl</literal> command. To read information from a <literal>timeouts</literal>
1862 file, enter a command similar to:</para>
1863 <screen># lctl get_param -n ost.*.ost_io.timeouts
1864 service : cur 33 worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
1865 <para>In this example, the <literal>ost_io</literal> service on this node is currently
1866 reporting an estimated RPC service time of 33 seconds. The worst RPC service time was 34
1867 seconds, which occurred 26 minutes ago.</para>
1868 <para>The output also provides a history of service times. Four "bins" of adaptive
1869 timeout history are shown, with the maximum RPC time in each bin reported. In both the
1870 0-150s bin and the 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
1871 worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a maximum of RPC time
1872 of 2 seconds. The estimated service time is the maximum value across the four bins (33
1873 seconds in this example).</para>
1874 <para>Service times (as reported by the servers) are also tracked in the client OBDs, as
1875 shown in this example:</para>
1876 <screen># lctl get_param osc.*.timeouts
1877 last reply : 1193428639, 0d0h00m00s ago
1878 network : cur 1 worst 2 (at 1193427053, 0d0h26m26s ago) 1 1 1 1
1879 portal 6 : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33 2
1880 portal 28 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 1 1 1
1881 portal 7 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 0 1 1
1882 portal 17 : cur 1 worst 1 (at 1193426177, 0d0h41m02s ago) 1 0 0 1
1884 <para>In this example, portal 6, the <literal>ost_io</literal> service portal, shows the
1885 history of service estimates reported by the portal.</para>
1886 <para>Server statistic files also show the range of estimates including min, max, sum, and
1887 sumsq. For example:</para>
1888 <screen># lctl get_param mdt.*.mdt.stats
1890 req_timeout 6 samples [sec] 1 10 15 105
1895 <section xml:id="section_c24_nt5_dl">
1896 <title>Setting Static Timeouts<indexterm>
1897 <primary>proc</primary>
1898 <secondary>static timeouts</secondary>
1899 </indexterm></title>
1900 <para>The Lustre software provides two sets of static (fixed) timeouts, LND timeouts and
1901 Lustre timeouts, which are used when adaptive timeouts are not enabled.</para>
1905 <para><emphasis role="italic"><emphasis role="bold">LND timeouts</emphasis></emphasis> -
1906 LND timeouts ensure that point-to-point communications across a network complete in a
1907 finite time in the presence of failures, such as packages lost or broken connections.
1908 LND timeout parameters are set for each individual LND.</para>
1909 <para>LND timeouts are logged with the <literal>S_LND</literal> flag set. They are not
1910 printed as console messages, so check the Lustre log for <literal>D_NETERROR</literal>
1911 messages or enable printing of <literal>D_NETERROR</literal> messages to the console
1912 using:<screen>lctl set_param printk=+neterror</screen></para>
1913 <para>Congested routers can be a source of spurious LND timeouts. To avoid this
1914 situation, increase the number of LNet router buffers to reduce back-pressure and/or
1915 increase LND timeouts on all nodes on all connected networks. Also consider increasing
1916 the total number of LNet router nodes in the system so that the aggregate router
1917 bandwidth matches the aggregate server bandwidth.</para>
1920 <para><emphasis role="italic"><emphasis role="bold">Lustre timeouts
1921 </emphasis></emphasis>- Lustre timeouts ensure that Lustre RPCs complete in a finite
1922 time in the presence of failures when adaptive timeouts are not enabled. Adaptive
1923 timeouts are enabled by default. To disable adaptive timeouts at run time, set
1924 <literal>at_max</literal> to 0 by running on the
1925 MGS:<screen># lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen></para>
1927 <para>Changing the status of adaptive timeouts at runtime may cause a transient client
1928 timeout, recovery, and reconnection.</para>
1930 <para>Lustre timeouts are always printed as console messages. </para>
1931 <para>If Lustre timeouts are not accompanied by LND timeouts, increase the Lustre
1932 timeout on both servers and clients. Lustre timeouts are set using a command such as
1933 the following:<screen># lctl set_param timeout=30</screen></para>
1934 <para>Lustre timeout parameters are described in the table below.</para>
1937 <informaltable frame="all">
1939 <colspec colname="c1" colnum="1" colwidth="30*"/>
1940 <colspec colname="c2" colnum="2" colwidth="70*"/>
1943 <entry>Parameter</entry>
1944 <entry>Description</entry>
1949 <entry><literal>timeout</literal></entry>
1951 <para>The time that a client waits for a server to complete an RPC (default 100s).
1952 Servers wait half this time for a normal client RPC to complete and a quarter of
1953 this time for a single bulk request (read or write of up to 4 MB) to complete.
1954 The client pings recoverable targets (MDS and OSTs) at one quarter of the
1955 timeout, and the server waits one and a half times the timeout before evicting a
1956 client for being "stale."</para>
1957 <para>Lustre client sends periodic 'ping' messages to servers with which
1958 it has had no communication for the specified period of time. Any network
1959 activity between a client and a server in the file system also serves as a
1964 <entry><literal>ldlm_timeout</literal></entry>
1966 <para>The time that a server waits for a client to reply to an initial AST (lock
1967 cancellation request). The default is 20s for an OST and 6s for an MDS. If the
1968 client replies to the AST, the server will give it a normal timeout (half the
1969 client timeout) to flush any dirty data and release the lock.</para>
1973 <entry><literal>fail_loc</literal></entry>
1975 <para>An internal debugging failure hook. The default value of
1976 <literal>0</literal> means that no failure will be triggered or
1981 <entry><literal>dump_on_timeout</literal></entry>
1983 <para>Triggers a dump of the Lustre debug log when a timeout occurs. The default
1984 value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
1985 not be triggered.</para>
1989 <entry><literal>dump_on_eviction</literal></entry>
1991 <para>Triggers a dump of the Lustre debug log when an eviction occurs. The default
1992 value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
1993 not be triggered. </para>
2002 <section remap="h3">
2004 <primary>proc</primary>
2005 <secondary>LNet</secondary>
2006 </indexterm><indexterm>
2007 <primary>LNet</primary>
2008 <secondary>proc</secondary>
2009 </indexterm>Monitoring LNet</title>
2010 <para>LNet information is located in <literal>/proc/sys/lnet</literal> in these files:<itemizedlist>
2012 <para><literal>peers</literal> - Shows all NIDs known to this node and provides
2013 information on the queue state.</para>
2014 <para>Example:</para>
2015 <screen># lctl get_param peers
2016 nid refs state max rtr min tx min queue
2017 0@lo 1 ~rtr 0 0 0 0 0 0
2018 192.168.10.35@tcp 1 ~rtr 8 8 8 8 6 0
2019 192.168.10.36@tcp 1 ~rtr 8 8 8 8 6 0
2020 192.168.10.37@tcp 1 ~rtr 8 8 8 8 6 0</screen>
2021 <para>The fields are explained in the table below:</para>
2022 <informaltable frame="all">
2024 <colspec colname="c1" colwidth="30*"/>
2025 <colspec colname="c2" colwidth="80*"/>
2029 <para><emphasis role="bold">Field</emphasis></para>
2032 <para><emphasis role="bold">Description</emphasis></para>
2040 <literal>refs</literal>
2044 <para>A reference count. </para>
2050 <literal>state</literal>
2054 <para>If the node is a router, indicates the state of the router. Possible
2058 <para><literal>NA</literal> - Indicates the node is not a router.</para>
2061 <para><literal>up/down</literal>- Indicates if the node (router) is up or
2070 <literal>max </literal></para>
2073 <para>Maximum number of concurrent sends from this peer.</para>
2079 <literal>rtr </literal></para>
2082 <para>Number of routing buffer credits.</para>
2088 <literal>min </literal></para>
2091 <para>Minimum number of routing buffer credits seen.</para>
2097 <literal>tx </literal></para>
2100 <para>Number of send credits.</para>
2106 <literal>min </literal></para>
2109 <para>Minimum number of send credits seen.</para>
2115 <literal>queue </literal></para>
2118 <para>Total bytes in active/queued sends.</para>
2124 <para>Credits are initialized to allow a certain number of operations (in the example
2125 above the table, eight as shown in the <literal>max</literal> column. LNet keeps track
2126 of the minimum number of credits ever seen over time showing the peak congestion that
2127 has occurred during the time monitored. Fewer available credits indicates a more
2128 congested resource. </para>
2129 <para>The number of credits currently in flight (number of transmit credits) is shown in
2130 the <literal>tx</literal> column. The maximum number of send credits available is shown
2131 in the <literal>max</literal> column and never changes. The number of router buffers
2132 available for consumption by a peer is shown in the <literal>rtr</literal>
2134 <para>Therefore, <literal>rtr</literal> – <literal>tx</literal> is the number of transmits
2135 in flight. Typically, <literal>rtr == max</literal>, although a configuration can be set
2136 such that <literal>max >= rtr</literal>. The ratio of routing buffer credits to send
2137 credits (<literal>rtr/tx</literal>) that is less than <literal>max</literal> indicates
2138 operations are in progress. If the ratio <literal>rtr/tx</literal> is greater than
2139 <literal>max</literal>, operations are blocking.</para>
2140 <para>LNet also limits concurrent sends and number of router buffers allocated to a single
2141 peer so that no peer can occupy all these resources.</para>
2144 <para><literal>nis</literal> - Shows the current queue health on this node.</para>
2145 <para>Example:</para>
2146 <screen># lctl get_param nis
2147 nid refs peer max tx min
2149 192.168.10.34@tcp 4 8 256 256 252
2151 <para> The fields are explained in the table below.</para>
2152 <informaltable frame="all">
2154 <colspec colname="c1" colwidth="30*"/>
2155 <colspec colname="c2" colwidth="80*"/>
2159 <para><emphasis role="bold">Field</emphasis></para>
2162 <para><emphasis role="bold">Description</emphasis></para>
2170 <literal> nid </literal></para>
2173 <para>Network interface.</para>
2179 <literal> refs </literal></para>
2182 <para>Internal reference counter.</para>
2188 <literal> peer </literal></para>
2191 <para>Number of peer-to-peer send credits on this NID. Credits are used to size
2192 buffer pools.</para>
2198 <literal> max </literal></para>
2201 <para>Total number of send credits on this NID.</para>
2207 <literal> tx </literal></para>
2210 <para>Current number of send credits available on this NID.</para>
2216 <literal> min </literal></para>
2219 <para>Lowest number of send credits available on this NID.</para>
2225 <literal> queue </literal></para>
2228 <para>Total bytes in active/queued sends.</para>
2234 <para><emphasis role="bold"><emphasis role="italic">Analysis:</emphasis></emphasis></para>
2235 <para>Subtracting <literal>max</literal> from <literal>tx</literal>
2236 (<literal>max</literal> - <literal>tx</literal>) yields the number of sends currently
2237 active. A large or increasing number of active sends may indicate a problem.</para>
2239 </itemizedlist></para>
2241 <section remap="h3" xml:id="dbdoclet.balancing_free_space">
2243 <primary>proc</primary>
2244 <secondary>free space</secondary>
2245 </indexterm>Allocating Free Space on OSTs</title>
2246 <para>Free space is allocated using either a round-robin or a weighted
2247 algorithm. The allocation method is determined by the maximum amount of
2248 free-space imbalance between the OSTs. When free space is relatively
2249 balanced across OSTs, the faster round-robin allocator is used, which
2250 maximizes network balancing. The weighted allocator is used when any two
2251 OSTs are out of balance by more than a specified threshold.</para>
2252 <para>Free space distribution can be tuned using these two
2253 <literal>/proc</literal> tunables:</para>
2256 <para><literal>qos_threshold_rr</literal> - The threshold at which
2257 the allocation method switches from round-robin to weighted is set
2258 in this file. The default is to switch to the weighted algorithm when
2259 any two OSTs are out of balance by more than 17 percent.</para>
2262 <para><literal>qos_prio_free</literal> - The weighting priority used
2263 by the weighted allocator can be adjusted in this file. Increasing the
2264 value of <literal>qos_prio_free</literal> puts more weighting on the
2265 amount of free space available on each OST and less on how stripes are
2266 distributed across OSTs. The default value is 91 percent weighting for
2267 free space rebalancing and 9 percent for OST balancing. When the
2268 free space priority is set to 100, weighting is based entirely on free
2269 space and location is no longer used by the striping algorithm.</para>
2272 <para condition="l29"><literal>reserved_mb_low</literal> - The low
2273 watermark used to stop object allocation if available space is less
2274 than it. The default is 0.1 percent of total OST size.</para>
2277 <para condition="l29"><literal>reserved_mb_high</literal> - The high watermark used to start
2278 object allocation if available space is more than it. The default is 0.2 percent of total
2282 <para>For more information about monitoring and managing free space, see <xref
2283 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438209_10424"/>.</para>
2285 <section remap="h3">
2287 <primary>proc</primary>
2288 <secondary>locking</secondary>
2289 </indexterm>Configuring Locking</title>
2290 <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
2291 locks in an LRU cached locks queue. LRU size is dynamic, based on load to optimize the number
2292 of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
2293 nodes vs. backup nodes).</para>
2294 <para>The total number of locks available is a function of the server RAM. The default limit is
2295 50 locks/1 MB of RAM. If memory pressure is too high, the LRU size is shrunk. The number of
2296 locks on the server is limited to <emphasis role="italic">the number of OSTs per
2297 server</emphasis> * <emphasis role="italic">the number of clients</emphasis> * <emphasis
2298 role="italic">the value of the</emphasis>
2299 <literal>lru_size</literal>
2300 <emphasis role="italic">setting on the client</emphasis> as follows: </para>
2303 <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In
2304 this case, the <literal>lru_size</literal> parameter shows the current number of locks
2305 being used on the export. LRU sizing is enabled by default.</para>
2308 <para>To specify a maximum number of locks, set the <literal>lru_size</literal> parameter to
2309 a value other than zero but, normally, less than 100 * <emphasis role="italic">number of
2310 CPUs in client</emphasis>. It is recommended that you only increase the LRU size on a
2311 few login nodes where users access the file system interactively.</para>
2314 <para>To clear the LRU on a single client, and, as a result, flush client cache without changing
2315 the <literal>lru_size</literal> value, run:</para>
2316 <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
2317 <para>If the LRU size is set to be less than the number of existing unused locks, the unused
2318 locks are canceled immediately. Use <literal>echo clear</literal> to cancel all locks without
2319 changing the value.</para>
2321 <para>The <literal>lru_size</literal> parameter can only be set temporarily using
2322 <literal>lctl set_param</literal>; it cannot be set permanently.</para>
2324 <para>To disable LRU sizing, on the Lustre clients, run:</para>
2325 <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((<replaceable>NR_CPU</replaceable>*100))</screen>
2326 <para>Replace <literal><replaceable>NR_CPU</replaceable></literal> with the number of CPUs on
2328 <para>To determine the number of locks being granted, run:</para>
2329 <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
2331 <section xml:id="dbdoclet.50438271_87260">
2333 <primary>proc</primary>
2334 <secondary>thread counts</secondary>
2335 </indexterm>Setting MDS and OSS Thread Counts</title>
2336 <para>MDS and OSS thread counts tunable can be used to set the minimum and maximum thread counts
2337 or get the current number of running threads for the services listed in the table
2339 <informaltable frame="all">
2341 <colspec colname="c1" colwidth="50*"/>
2342 <colspec colname="c2" colwidth="50*"/>
2347 <emphasis role="bold">Service</emphasis></para>
2351 <emphasis role="bold">Description</emphasis></para>
2356 <literal> mds.MDS.mdt </literal>
2359 <para>Main metadata operations service</para>
2364 <literal> mds.MDS.mdt_readpage </literal>
2367 <para>Metadata <literal>readdir</literal> service</para>
2372 <literal> mds.MDS.mdt_setattr </literal>
2375 <para>Metadata <literal>setattr/close</literal> operations service </para>
2380 <literal> ost.OSS.ost </literal>
2383 <para>Main data operations service</para>
2388 <literal> ost.OSS.ost_io </literal>
2391 <para>Bulk data I/O services</para>
2396 <literal> ost.OSS.ost_create </literal>
2399 <para>OST object pre-creation service</para>
2404 <literal> ldlm.services.ldlm_canceld </literal>
2407 <para>DLM lock cancel service</para>
2412 <literal> ldlm.services.ldlm_cbd </literal>
2415 <para>DLM lock grant service</para>
2421 <para>For each service, an entry as shown below is
2422 created:<screen>/proc/fs/lustre/<replaceable>service</replaceable>/*/threads_<replaceable>min|max|started</replaceable></screen></para>
2425 <para>To temporarily set this tunable, run:</para>
2426 <screen># lctl <replaceable>get|set</replaceable>_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
2429 <para>To permanently set this tunable, run:</para>
2430 <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
2431 <para condition='l25'>For version 2.5 or later, run:
2432 <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
2435 <para>The following examples show how to set thread counts and get the number of running threads
2436 for the service <literal>ost_io</literal> using the tunable
2437 <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
2440 <para>To get the number of running threads, run:</para>
2441 <screen># lctl get_param ost.OSS.ost_io.threads_started
2442 ost.OSS.ost_io.threads_started=128</screen>
2445 <para>To set the number of threads to the maximum value (512), run:</para>
2446 <screen># lctl get_param ost.OSS.ost_io.threads_max
2447 ost.OSS.ost_io.threads_max=512</screen>
2450 <para>To set the maximum thread count to 256 instead of 512 (to avoid overloading the
2451 storage or for an array with requests), run:</para>
2452 <screen># lctl set_param ost.OSS.ost_io.threads_max=256
2453 ost.OSS.ost_io.threads_max=256</screen>
2456 <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
2457 <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
2458 <para condition='l25'>For version 2.5 or later, run:
2459 <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
2460 ost.OSS.ost_io.threads_max=256 </screen> </para>
2463 <para> To check if the <literal>threads_max</literal> setting is active, run:</para>
2464 <screen># lctl get_param ost.OSS.ost_io.threads_max
2465 ost.OSS.ost_io.threads_max=256</screen>
2469 <para>If the number of service threads is changed while the file system is running, the change
2470 may not take effect until the file system is stopped and rest. If the number of service
2471 threads in use exceeds the new <literal>threads_max</literal> value setting, service threads
2472 that are already running will not be stopped.</para>
2474 <para>See also <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustretuning"/></para>
2476 <section xml:id="dbdoclet.50438271_83523">
2478 <primary>proc</primary>
2479 <secondary>debug</secondary>
2480 </indexterm>Enabling and Interpreting Debugging Logs</title>
2481 <para>By default, a detailed log of all operations is generated to aid in debugging. Flags that
2482 control debugging are found in <literal>/proc/sys/lnet/debug</literal>. </para>
2483 <para>The overhead of debugging can affect the performance of Lustre file system. Therefore, to
2484 minimize the impact on performance, the debug level can be lowered, which affects the amount
2485 of debugging information kept in the internal log buffer but does not alter the amount of
2486 information to goes into syslog. You can raise the debug level when you need to collect logs
2487 to debug problems. </para>
2488 <para>The debugging mask can be set using "symbolic names". The symbolic format is
2489 shown in the examples below.<itemizedlist>
2491 <para>To verify the debug level used, examine the <literal>sysctl</literal> that controls
2492 debugging by running:</para>
2493 <screen># sysctl lnet.debug
2494 lnet.debug = ioctl neterror warning error emerg ha config console</screen>
2497 <para>To turn off debugging (except for network error debugging), run the following
2498 command on all nodes concerned:</para>
2499 <screen># sysctl -w lnet.debug="neterror"
2500 lnet.debug = neterror</screen>
2502 </itemizedlist><itemizedlist>
2504 <para>To turn off debugging completely, run the following command on all nodes
2506 <screen># sysctl -w lnet.debug=0
2507 lnet.debug = 0</screen>
2510 <para>To set an appropriate debug level for a production environment, run:</para>
2511 <screen># sysctl -w lnet.debug="warning dlmtrace error emerg ha rpctrace vfstrace"
2512 lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
2513 <para>The flags shown in this example collect enough high-level information to aid
2514 debugging, but they do not cause any serious performance impact.</para>
2516 </itemizedlist><itemizedlist>
2518 <para>To clear all flags and set new flags, run:</para>
2519 <screen># sysctl -w lnet.debug="warning"
2520 lnet.debug = warning</screen>
2522 </itemizedlist><itemizedlist>
2524 <para>To add new flags to flags that have already been set, precede each one with a
2525 "<literal>+</literal>":</para>
2526 <screen># sysctl -w lnet.debug="+neterror +ha"
2527 lnet.debug = +neterror +ha
2529 lnet.debug = neterror warning ha</screen>
2532 <para>To remove individual flags, precede them with a
2533 "<literal>-</literal>":</para>
2534 <screen># sysctl -w lnet.debug="-ha"
2537 lnet.debug = neterror warning</screen>
2540 <para>To verify or change the debug level, run commands such as the following: :</para>
2541 <screen># lctl get_param debug
2544 # lctl set_param debug=+ha
2545 # lctl get_param debug
2548 # lctl set_param debug=-warning
2549 # lctl get_param debug
2551 neterror ha</screen>
2553 </itemizedlist></para>
2554 <para>Debugging parameters include:</para>
2557 <para><literal>subsystem_debug</literal> - Controls the debug logs for subsystems.</para>
2560 <para><literal>debug_path</literal> - Indicates the location where the debug log is dumped
2561 when triggered automatically or manually. The default path is
2562 <literal>/tmp/lustre-log</literal>.</para>
2565 <para>These parameters are also set using:<screen>sysctl -w lnet.debug={value}</screen></para>
2566 <para>Additional useful parameters: <itemizedlist>
2568 <para><literal>panic_on_lbug</literal> - Causes ''panic'' to be called
2569 when the Lustre software detects an internal problem (an <literal>LBUG</literal> log
2570 entry); panic crashes the node. This is particularly useful when a kernel crash dump
2571 utility is configured. The crash dump is triggered when the internal inconsistency is
2572 detected by the Lustre software. </para>
2575 <para><literal>upcall</literal> - Allows you to specify the path to the binary which will
2576 be invoked when an <literal>LBUG</literal> log entry is encountered. This binary is
2577 called with four parameters:</para>
2578 <para> - The string ''<literal>LBUG</literal>''.</para>
2579 <para> - The file where the <literal>LBUG</literal> occurred.</para>
2580 <para> - The function name.</para>
2581 <para> - The line number in the file</para>
2583 </itemizedlist></para>
2585 <title>Interpreting OST Statistics</title>
2587 <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2588 <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2590 <para>OST <literal>stats</literal> files can be used to provide statistics showing activity
2591 for each OST. For example:</para>
2592 <screen># lctl get_param osc.testfs-OST0000-osc.stats
2593 snapshot_time 1189732762.835363
2598 obd_ping 212</screen>
2599 <para>Use the <literal>llstat</literal> utility to monitor statistics over time.</para>
2600 <para>To clear the statistics, use the <literal>-c</literal> option to
2601 <literal>llstat</literal>. To specify how frequently the statistics should be reported (in
2602 seconds), use the <literal>-i</literal> option. In the example below, the
2603 <literal>-c</literal> option clears the statistics and <literal>-i10</literal> option
2604 reports statistics every 10 seconds:</para>
2605 <screen role="smaller">$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
2607 /usr/bin/llstat: STATS on 06/06/07
2608 /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
2609 snapshot_time 1181074093.276072
2611 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
2613 Count Rate Events Unit last min avg max stddev
2614 req_waittime 8 0 8 [usec] 2078 34 259.75 868 317.49
2615 req_qdepth 8 0 8 [reqs] 1 0 0.12 1 0.35
2616 req_active 8 0 8 [reqs] 11 1 1.38 2 0.52
2617 reqbuf_avail 8 0 8 [bufs] 511 63 63.88 64 0.35
2618 ost_write 8 0 8 [bytes] 169767 72914 212209.62 387579 91874.29
2620 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
2622 Count Rate Events Unit last min avg max stddev
2623 req_waittime 31 3 39 [usec] 30011 34 822.79 12245 2047.71
2624 req_qdepth 31 3 39 [reqs] 0 0 0.03 1 0.16
2625 req_active 31 3 39 [reqs] 58 1 1.77 3 0.74
2626 reqbuf_avail 31 3 39 [bufs] 1977 63 63.79 64 0.41
2627 ost_write 30 3 38 [bytes] 1028467 15019 315325.16 910694 197776.51
2629 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
2631 Count Rate Events Unit last min avg max stddev
2632 req_waittime 21 2 60 [usec] 14970 34 784.32 12245 1878.66
2633 req_qdepth 21 2 60 [reqs] 0 0 0.02 1 0.13
2634 req_active 21 2 60 [reqs] 33 1 1.70 3 0.70
2635 reqbuf_avail 21 2 60 [bufs] 1341 63 63.82 64 0.39
2636 ost_write 21 2 59 [bytes] 7648424 15019 332725.08 910694 180397.87
2638 <para>The columns in this example are described in the table below.</para>
2639 <informaltable frame="all">
2641 <colspec colname="c1" colwidth="50*"/>
2642 <colspec colname="c2" colwidth="50*"/>
2646 <para><emphasis role="bold">Parameter</emphasis></para>
2649 <para><emphasis role="bold">Description</emphasis></para>
2655 <entry><literal>Name</literal></entry>
2656 <entry>Name of the service event. See the tables below for descriptions of service
2657 events that are tracked.</entry>
2662 <literal>Cur. Count </literal></para>
2665 <para>Number of events of each type sent in the last interval.</para>
2671 <literal>Cur. Rate </literal></para>
2674 <para>Number of events per second in the last interval.</para>
2680 <literal> # Events </literal></para>
2683 <para>Total number of such events since the events have been cleared.</para>
2689 <literal> Unit </literal></para>
2692 <para>Unit of measurement for that statistic (microseconds, requests,
2699 <literal> last </literal></para>
2702 <para>Average rate of these events (in units/event) for the last interval during
2703 which they arrived. For instance, in the above mentioned case of
2704 <literal>ost_destroy</literal> it took an average of 736 microseconds per
2705 destroy for the 400 object destroys in the previous 10 seconds.</para>
2711 <literal> min </literal></para>
2714 <para>Minimum rate (in units/events) since the service started.</para>
2720 <literal> avg </literal></para>
2723 <para>Average rate.</para>
2729 <literal> max </literal></para>
2732 <para>Maximum rate.</para>
2738 <literal> stddev </literal></para>
2741 <para>Standard deviation (not measured in some cases)</para>
2747 <para>Events common to all services are shown in the table below.</para>
2748 <informaltable frame="all">
2750 <colspec colname="c1" colwidth="50*"/>
2751 <colspec colname="c2" colwidth="50*"/>
2755 <para><emphasis role="bold">Parameter</emphasis></para>
2758 <para><emphasis role="bold">Description</emphasis></para>
2766 <literal> req_waittime </literal></para>
2769 <para>Amount of time a request waited in the queue before being handled by an
2770 available server thread.</para>
2776 <literal> req_qdepth </literal></para>
2779 <para>Number of requests waiting to be handled in the queue for this service.</para>
2785 <literal> req_active </literal></para>
2788 <para>Number of requests currently being handled.</para>
2794 <literal> reqbuf_avail </literal></para>
2797 <para>Number of unsolicited lnet request buffers for this service.</para>
2803 <para>Some service-specific events of interest are described in the table below.</para>
2804 <informaltable frame="all">
2806 <colspec colname="c1" colwidth="50*"/>
2807 <colspec colname="c2" colwidth="50*"/>
2811 <para><emphasis role="bold">Parameter</emphasis></para>
2814 <para><emphasis role="bold">Description</emphasis></para>
2822 <literal> ldlm_enqueue </literal></para>
2825 <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
2831 <literal> mds_reint </literal></para>
2834 <para>Time it takes to process an MDS modification record (includes
2835 <literal>create</literal>, <literal>mkdir</literal>, <literal>unlink</literal>,
2836 <literal>rename</literal> and <literal>setattr</literal>)</para>
2844 <title>Interpreting MDT Statistics</title>
2846 <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2847 <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2849 <para>MDT <literal>stats</literal> files can be used to track MDT
2850 statistics for the MDS. The example below shows sample output from an
2851 MDT <literal>stats</literal> file.</para>
2852 <screen># lctl get_param mds.*-MDT0000.stats
2853 snapshot_time 1244832003.676892 secs.usecs
2854 open 2 samples [reqs]
2855 close 1 samples [reqs]
2856 getxattr 3 samples [reqs]
2857 process_config 1 samples [reqs]
2858 connect 2 samples [reqs]
2859 disconnect 2 samples [reqs]
2860 statfs 3 samples [reqs]
2861 setattr 1 samples [reqs]
2862 getattr 3 samples [reqs]
2863 llog_init 6 samples [reqs]
2864 notify 16 samples [reqs]</screen>