X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=LustreProc.xml;h=64b3f9f2887b86259eb0a6f5902c43667f0357e3;hb=5b911534777ba9fe568cd837bf25af611fed2af3;hp=2347368aa58efcf44f05c88a06685d4a5867f62b;hpb=ac892fc8bedc81a600b17e221a74f489b82315cc;p=doc%2Fmanual.git diff --git a/LustreProc.xml b/LustreProc.xml index 2347368..64b3f9f 100644 --- a/LustreProc.xml +++ b/LustreProc.xml @@ -1,10 +1,11 @@ - LustreProc - The /proc file system acts as an interface to internal data structures in - the kernel. This chapter describes entries in /proc that are useful for - tuning and monitoring aspects of a Lustre file system. It includes these sections: + Lustre Parameters + The /proc and /sys file systems + acts as an interface to internal data structures in the kernel. This chapter + describes parameters and tunables that are useful for optimizing and + monitoring aspects of a Lustre file system. It includes these sections: @@ -12,26 +13,30 @@
- Introduction to <literal>/proc</literal> - The /proc directory provides an interface to internal data structures - in the kernel that enables monitoring and tuning of many aspects of Lustre file system and - application performance These data structures include settings and metrics for components such - as memory, networking, file systems, and kernel housekeeping routines, which are available - throughout the hierarchical file layout in /proc. + Introduction to Lustre Parameters + Lustre parameters and statistics files provide an interface to + internal data structures in the kernel that enables monitoring and + tuning of many aspects of Lustre file system and application performance. + These data structures include settings and metrics for components such + as memory, networking, file systems, and kernel housekeeping routines, + which are available throughout the hierarchical file layout. - Typically, metrics are accessed by reading from /proc files and - settings are changed by writing to /proc files. Some data is server-only, - some data is client-only, and some data is exported from the client to the server and is thus - duplicated in both locations. + Typically, metrics are accessed via lctl get_param + files and settings are changed by via lctl set_param. + While it is possible to access parameters in /proc + and /sys directly, the location of these parameters may + change between releases, so it is recommended to always use + lctl to access the parameters from userspace scripts. + Some data is server-only, some data is client-only, and some data is + exported from the client to the server and is thus duplicated in both + locations. - In the examples in this chapter, # indicates a command is entered as - root. Servers are named according to the convention - fsname-MDT|OSTnumber. + In the examples in this chapter, # indicates + a command is entered as root. Lustre servers are named according to the + convention fsname-MDT|OSTnumber. The standard UNIX wildcard designation (*) is used. - In most cases, information is accessed using the lctl get_param command - and settings are changed using the lctl set_param command. Some examples - are shown below: + Some examples are shown below: To obtain data from a Lustre client: @@ -45,8 +50,8 @@ osc.testfs-OST0005-osc-ffff881071d5cc00 osc.testfs-OST0006-osc-ffff881071d5cc00 osc.testfs-OST0007-osc-ffff881071d5cc00 osc.testfs-OST0008-osc-ffff881071d5cc00 - In this example, information about OST connections available on a client is displayed - (indicated by "osc"). + In this example, information about OST connections available + on a client is displayed (indicated by "osc"). @@ -66,22 +71,32 @@ osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats - To view a specific file, use lctl get_param - :# lctl get_param osc.lustre-OST0000-osc-ffff881071d5cc00.rpc_stats + To view a specific file, use lctl get_param: + # lctl get_param osc.lustre-OST0000*.rpc_stats For more information about using lctl, see . - Data can also be viewed using the cat command with the full path to the - file. The form of the cat command is similar to that of the lctl - get_param command with these differences. In the cat command: + Data can also be viewed using the cat command + with the full path to the file. The form of the cat + command is similar to that of the lctl get_param + command with some differences. Unfortunately, as the Linux kernel has + changed over the years, the location of statistics and parameter files + has also changed, which means that the Lustre parameter files may be + located in either the /proc directory, in the + /sys directory, and/or in the + /sys/kernel/debug directory, depending on the kernel + version and the Lustre version being used. The lctl + command insulates scripts from these changes and is preferred over direct + file access, unless as part of a high-performance monitoring system. + In the cat command: - Replace the dots in the path with slashes. + Replace the dots in the path with slashes. - Prepend the path with the following as - appropriate:/proc/{fs,sys}/{lustre,lnet} + Prepend the path with the appropriate directory component: + /{proc,sys}/{fs,sys}/{lustre,lnet} For example, an lctl get_param command may look like @@ -89,23 +104,30 @@ osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats osc.testfs-OST0000-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819 osc.testfs-OST0001-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819 ... - The equivalent cat command looks like - this:# cat /proc/fs/lustre/osc/*/uuid + The equivalent cat command may look like this: + # cat /proc/fs/lustre/osc/*/uuid 594db456-0685-bd16-f59b-e72ee90e9819 594db456-0685-bd16-f59b-e72ee90e9819 ... - The llstat utility can be used to monitor some Lustre file system I/O - activity over a specified time period. For more details, see - Some data is imported from attached clients and is available in a directory called - exports located in the corresponding per-service directory on a Lustre - server. For - example:# ls /proc/fs/lustre/obdfilter/testfs-OST0000/exports/192.168.124.9\@o2ib1/ + or like this: + # cat /sys/fs/lustre/osc/*/uuid +594db456-0685-bd16-f59b-e72ee90e9819 +594db456-0685-bd16-f59b-e72ee90e9819 +... + The llstat utility can be used to monitor some + Lustre file system I/O activity over a specified time period. For more + details, see + + Some data is imported from attached clients and is available in a + directory called exports located in the corresponding + per-service directory on a Lustre server. For example: + oss:/root# lctl list_param obdfilter.testfs-OST0000.exports.* # hash ldlm_stats stats uuid
Identifying Lustre File Systems and Servers - Several /proc files on the MGS list existing Lustre file systems and - file system servers. The examples below are for a Lustre file system called + Several parameter files on the MGS list existing + Lustre file systems and file system servers. The examples below are for + a Lustre file system called testfs with one MDT and three OSTs. @@ -137,8 +159,7 @@ imperative_recovery_state: notify_count: 4 - To view the names of all live servers in the file system as listed in - /proc/fs/lustre/devices, enter: + To list all configured devices on the local node, enter: # lctl device_list 0 UP mgs MGS MGS 11 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705 @@ -266,7 +287,7 @@ testfs-MDT0000 - mb_prealloc_table + prealloc_table A table of values used to preallocate space when a new request is received. By @@ -297,9 +318,40 @@ testfs-MDT0000 + Buddy group cache information found in + /sys/fs/ldiskfs/disk_device/mb_groups may + be useful for assessing on-disk fragmentation. For + example:cat /proc/fs/ldiskfs/loop0/mb_groups +#group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 + 2^10 2^11 2^12 2^13] +#0 : 2936 2936 1 42 0 [ 0 0 0 1 1 1 1 2 0 1 + 2 0 0 0 ] + In this example, the columns show: + + #group number + + + Available blocks in the group + + + Blocks free on a disk + + + Number of free fragments + + + First free block in the group + + + Number of preallocated chunks (not blocks) + + + A series of available chunks of different sizes + +
- Monitoring Lustre File System I/O + Monitoring Lustre File System I/O A number of system utilities are provided to enable collection of data related to I/O activity in a Lustre file system. In general, the data collected describes: @@ -389,7 +441,7 @@ offset rpcs % cum % | rpcs % cum % The header information includes: - snapshot_time - UNIX* epoch instant the file was read. + snapshot_time - UNIX epoch instant the file was read. read RPCs in flight - Number of read RPCs issued by the OSC, but @@ -477,105 +529,6 @@ offset rpcs % cum % | rpcs % cum % For information about optimizing the client I/O RPC stream, see .
-
- <indexterm> - <primary>proc</primary> - <secondary>read/write survey</secondary> - </indexterm>Monitoring Client Read-Write Offset Statistics - The offset_stats parameter maintains statistics for occurrences of a - series of read or write calls from a process that did not access the next sequential - location. The OFFSET field is reset to 0 (zero) whenever a different file - is read or written. - Read/write offset statistics are "off" by default. The statistics can be activated by - writing anything into the offset_stats file. - The offset_stats file can be cleared by - entering:lctl set_param llite.*.offset_stats=0 - Example: - # lctl get_param llite.testfs-f57dee0.offset_stats -snapshot_time: 1155748884.591028 (secs.usecs) - RANGE RANGE SMALLEST LARGEST -R/W PID START END EXTENT EXTENT OFFSET -R 8385 0 128 128 128 0 -R 8385 0 224 224 224 -128 -W 8385 0 250 50 100 0 -W 8385 100 1110 10 500 -150 -W 8384 0 5233 5233 5233 0 -R 8385 500 600 100 100 -610 - In this example, snapshot_time is the UNIX epoch instant the file was - read. The tabular data is described in the table below. - - - - - - - - Field - - - Description - - - - - - - R/W - - - Indicates if the non-sequential call was a read or write - - - - - PID - - - Process ID of the process that made the read/write call. - - - - - RANGE START/RANGE END - - - Range in which the read/write calls were sequential. - - - - - SMALLEST EXTENT - - - Smallest single read/write in the corresponding range (in bytes). - - - - - LARGEST EXTENT - - - Largest single read/write in the corresponding range (in bytes). - - - - - OFFSET - - - Difference between the previous range end and the current range start. - - - - - - Analysis: - This data provides an indication of how contiguous or fragmented the data is. For - example, the fourth entry in the example above shows the writes for this RPC were sequential - in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes. - The range started with an offset of -150 from the RANGE END of the - previous entry in the example. -
<indexterm> <primary>proc</primary> @@ -584,9 +537,7 @@ R 8385 500 600 100 100 -610</screen> <para>The <literal>stats</literal> file maintains statistics accumulate during typical operation of a client across the VFS interface of the Lustre file system. Only non-zero parameters are displayed in the file. </para> - <para>Client statistics are enabled by default. The statistics can be cleared by echoing an - empty string into the <literal>stats</literal> file or by using the command: - <screen>lctl set_param llite.*.stats=0</screen></para> + <para>Client statistics are enabled by default.</para> <note> <para>Statistics for all mounted file systems can be discovered by entering:<screen>lctl get_param llite.*.stats</screen></para> @@ -608,6 +559,9 @@ truncate 9073 samples [regs] setxattr 19059 samples [regs] getxattr 61169 samples [regs] </screen> + <para> The statistics can be cleared by echoing an empty string into the + <literal>stats</literal> file or by using the command: + <screen>lctl set_param llite.*.stats=0</screen></para> <para>The statistics displayed are described in the table below.</para> <informaltable frame="all"> <tgroup cols="2"> @@ -839,22 +793,135 @@ getxattr 61169 samples [regs] <title><indexterm> <primary>proc</primary> <secondary>read/write survey</secondary> + </indexterm>Monitoring Client Read-Write Offset Statistics + When the offset_stats parameter is set, statistics are maintained for + occurrences of a series of read or write calls from a process that did not access the next + sequential location. The OFFSET field is reset to 0 (zero) whenever a + different file is read or written. + + By default, statistics are not collected in the offset_stats, + extents_stats, and extents_stats_per_process files + to reduce monitoring overhead when this information is not needed. The collection of + statistics in all three of these files is activated by writing + anything, except for 0 (zero) and "disable", into any one of the + files. + + Example: + # lctl get_param llite.testfs-f57dee0.offset_stats +snapshot_time: 1155748884.591028 (secs.usecs) + RANGE RANGE SMALLEST LARGEST +R/W PID START END EXTENT EXTENT OFFSET +R 8385 0 128 128 128 0 +R 8385 0 224 224 224 -128 +W 8385 0 250 50 100 0 +W 8385 100 1110 10 500 -150 +W 8384 0 5233 5233 5233 0 +R 8385 500 600 100 100 -610 + In this example, snapshot_time is the UNIX epoch instant the file was + read. The tabular data is described in the table below. + The offset_stats file can be cleared by + entering:lctl set_param llite.*.offset_stats=0 + + + + + + + + Field + + + Description + + + + + + + R/W + + + Indicates if the non-sequential call was a read or write + + + + + PID + + + Process ID of the process that made the read/write call. + + + + + RANGE START/RANGE END + + + Range in which the read/write calls were sequential. + + + + + SMALLEST EXTENT + + + Smallest single read/write in the corresponding range (in bytes). + + + + + LARGEST EXTENT + + + Largest single read/write in the corresponding range (in bytes). + + + + + OFFSET + + + Difference between the previous range end and the current range start. + + + + + + Analysis: + This data provides an indication of how contiguous or fragmented the data is. For + example, the fourth entry in the example above shows the writes for this RPC were sequential + in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes. + The range started with an offset of -150 from the RANGE END of the + previous entry in the example. +
+
+ <indexterm> + <primary>proc</primary> + <secondary>read/write survey</secondary> </indexterm>Monitoring Client Read-Write Extent Statistics For in-depth troubleshooting, client read-write extent statistics can be accessed to obtain more detail about read/write I/O extents for the file system or for a particular process. + + By default, statistics are not collected in the offset_stats, + extents_stats, and extents_stats_per_process files + to reduce monitoring overhead when this information is not needed. The collection of + statistics in all three of these files is activated by writing + anything, except for 0 (zero) and "disable", into any one of the + files. +
Client-Based I/O Extent Size Survey - The rw_extent_stats histogram in the llite - directory shows the statistics for the sizes of the read?write I/O extents. This file does - not maintain the per-process statistics. The file can be cleared by issuing the following - command:# lctl set_param llite.testfs-*.extents_stats=0 + The extents_stats histogram in the + llite directory shows the statistics for the sizes + of the read/write I/O extents. This file does not maintain the per + process statistics. Example: # lctl get_param llite.testfs-*.extents_stats snapshot_time: 1213828728.348516 (secs.usecs) read | write extents calls % cum% | calls % cum% - + 0K - 4K : 0 0 0 | 2 2 2 4K - 8K : 0 0 0 | 0 0 2 8K - 16K : 0 0 0 | 0 0 2 @@ -869,8 +936,10 @@ extents calls % cum% | calls % cum% was read. The table shows cumulative extents organized according to size with statistics provided separately for reads and writes. Each row in the table shows the number of RPCs for reads and writes respectively (calls), the relative percentage of - total calls (%), and the cumulative percentage to that point in the - table of calls (cum %). + total calls (%), and the cumulative percentage to + that point in the table of calls (cum %). + The file can be cleared by issuing the following command: + # lctl set_param llite.testfs-*.extents_stats=1
Per-Process Client I/O Statistics @@ -1095,82 +1164,152 @@ disk I/O size ios % cum % | ios % cum %
Tuning Lustre File System I/O - Each OSC has its own tree of tunables. For example: - $ ls -d /proc/fs/testfs/osc/OSC_client_ost1_MNT_client_2 /localhost -/proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost -/proc/fs/testfs/osc/OSC_uml0_ost2_MNT_localhost -/proc/fs/testfs/osc/OSC_uml0_ost3_MNT_localhost - -$ ls /proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost -blocksizefilesfree max_dirty_mb ost_server_uuid stats - -... - The following sections describe some of the parameters that can be tuned in a Lustre file - system. + Each OSC has its own tree of tunables. For example: + $ lctl lctl list_param osc.*.* +osc.myth-OST0000-osc-ffff8804296c2800.active +osc.myth-OST0000-osc-ffff8804296c2800.blocksize +osc.myth-OST0000-osc-ffff8804296c2800.checksum_dump +osc.myth-OST0000-osc-ffff8804296c2800.checksum_type +osc.myth-OST0000-osc-ffff8804296c2800.checksums +osc.myth-OST0000-osc-ffff8804296c2800.connect_flags +: +: +osc.myth-OST0000-osc-ffff8804296c2800.state +osc.myth-OST0000-osc-ffff8804296c2800.stats +osc.myth-OST0000-osc-ffff8804296c2800.timeouts +osc.myth-OST0000-osc-ffff8804296c2800.unstable_stats +osc.myth-OST0000-osc-ffff8804296c2800.uuid +osc.myth-OST0001-osc-ffff8804296c2800.active +osc.myth-OST0001-osc-ffff8804296c2800.blocksize +osc.myth-OST0001-osc-ffff8804296c2800.checksum_dump +osc.myth-OST0001-osc-ffff8804296c2800.checksum_type +: +: + + The following sections describe some of the parameters that can + be tuned in a Lustre file system.
<indexterm> <primary>proc</primary> <secondary>RPC tunables</secondary> </indexterm>Tuning the Client I/O RPC Stream - Ideally, an optimal amount of data is packed into each I/O RPC and a consistent number - of issued RPCs are in progress at any time. To help optimize the client I/O RPC stream, - several tuning variables are provided to adjust behavior according to network conditions and - cluster size. For information about monitoring the client I/O RPC stream, see Ideally, an optimal amount of data is packed into each I/O RPC + and a consistent number of issued RPCs are in progress at any time. + To help optimize the client I/O RPC stream, several tuning variables + are provided to adjust behavior according to network conditions and + cluster size. For information about monitoring the client I/O RPC + stream, see . RPC stream tunables include: - osc.osc_instance.max_dirty_mb - - Controls how many MBs of dirty data can be written and queued up in the OSC. POSIX - file writes that are cached contribute to this count. When the limit is reached, - additional writes stall until previously-cached writes are written to the server. This - may be changed by writing a single ASCII integer to the file. Only values between 0 - and 2048 or 1/4 of RAM are allowable. If 0 is specified, no writes are cached. - Performance suffers noticeably unless you use large writes (1 MB or more). - To maximize performance, the value for max_dirty_mb is - recommended to be 4 * max_pages_per_rpc * - max_rpcs_in_flight. + osc.osc_instance.checksums + - Controls whether the client will calculate data integrity + checksums for the bulk data transferred to the OST. Data + integrity checksums are enabled by default. The algorithm used + can be set using the checksum_type parameter. + + + + osc.osc_instance.checksum_type + - Controls the data integrity checksum algorithm used by the + client. The available algorithms are determined by the set of + algorihtms. The checksum algorithm used by default is determined + by first selecting the fastest algorithms available on the OST, + and then selecting the fastest of those algorithms on the client, + which depends on available optimizations in the CPU hardware and + kernel. The default algorithm can be overridden by writing the + algorithm name into the checksum_type + parameter. Available checksum types can be seen on the client by + reading the checksum_type parameter. Currently + supported checksum types are: + adler, + crc32, + crc32c + + + In Lustre release 2.12 additional checksum types were added to + allow end-to-end checksum integration with T10-PI capable + hardware. The client will compute the appropriate checksum + type, based on the checksum type used by the storage, for the + RPC checksum, which will be verified by the server and passed + on to the storage. The T10-PI checksum types are: + t10ip512, + t10ip4K, + t10crc512, + t10crc4K + - osc.osc_instance.cur_dirty_bytes - A - read-only value that returns the current number of bytes written and cached on this - OSC. + osc.osc_instance.max_dirty_mb + - Controls how many MiB of dirty data can be written into the + client pagecache for writes by each OSC. + When this limit is reached, additional writes block until + previously-cached data is written to the server. This may be + changed by the lctl set_param command. Only + values larger than 0 and smaller than the lesser of 2048 MiB or + 1/4 of client RAM are valid. Performance can suffers if the + client cannot aggregate enough data per OSC to form a full RPC + (as set by the max_pages_per_rpc) parameter, + unless the application is doing very large writes itself. + + To maximize performance, the value for + max_dirty_mb is recommended to be at least + 4 * max_pages_per_rpc * + max_rpcs_in_flight. + - osc.osc_instance.max_pages_per_rpc - - The maximum number of pages that will undergo I/O in a single RPC to the OST. The - minimum setting is a single page and the maximum setting is 1024 (for systems with a - PAGE_SIZE of 4 KB), with the default maximum of 1 MB in the RPC. - It is also possible to specify a units suffix (e.g. 4M), so that - the RPC size can be specified independently of the client - PAGE_SIZE. + osc.osc_instance.cur_dirty_bytes + - A read-only value that returns the current number of bytes + written and cached by this OSC. + + + + osc.osc_instance.max_pages_per_rpc + - The maximum number of pages that will be sent in a single RPC + request to the OST. The minimum value is one page and the maximum + value is 16 MiB (4096 on systems with PAGE_SIZE + of 4 KiB), with the default value of 4 MiB in one RPC. The upper + limit may also be constrained by ofd.*.brw_size + setting on the OSS, and applies to all clients connected to that + OST. It is also possible to specify a units suffix (e.g. + max_pages_per_rpc=4M), so the RPC size can be + set independently of the client PAGE_SIZE. + osc.osc_instance.max_rpcs_in_flight - - The maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC - tries to initiate an RPC but finds that it already has the same number of RPCs - outstanding, it will wait to issue further RPCs until some complete. The minimum - setting is 1 and maximum setting is 256. + - The maximum number of concurrent RPCs in flight from an OSC to + its OST. If the OSC tries to initiate an RPC but finds that it + already has the same number of RPCs outstanding, it will wait to + issue further RPCs until some complete. The minimum setting is 1 + and maximum setting is 256. The default value is 8 RPCs. + To improve small file I/O performance, increase the - max_rpcs_in_flight value. + max_rpcs_in_flight value. + - llite.fsname-instance/max_cache_mb - - Maximum amount of inactive data cached by the client (default is 3/4 of RAM). For - example: - # lctl get_param llite.testfs-ce63ca00.max_cached_mb -128 + llite.fsname_instance.max_cache_mb + - Maximum amount of inactive data cached by the client. The + default value is 3/4 of the client RAM. + - The value for osc_instance is typically - fsname-OSTost_index-osc-mountpoint_instance, - where the value for mountpoint_instance is - unique to each mount point to allow associating osc, mdc, lov, lmv, and llite parameters - with the same mount point. For - example:lctl get_param osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats + The value for osc_instance + and fsname_instance + are unique to each mount point to allow associating osc, mdc, lov, + lmv, and llite parameters with the same mount point. However, it is + common for scripts to use a wildcard * or a + filesystem-specific wildcard + fsname-* to specify + the parameter settings uniformly on all clients. For example: + +client$ lctl get_param osc.testfs-OST0000*.rpc_stats osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats= snapshot_time: 1375743284.337839 (secs.usecs) read RPCs in flight: 0 @@ -1178,90 +1317,108 @@ write RPCs in flight: 0
-
+
<indexterm> <primary>proc</primary> <secondary>readahead</secondary> </indexterm>Tuning File Readahead and Directory Statahead - File readahead and directory statahead enable reading of data into memory before a - process requests the data. File readahead reads file content data into memory and directory - statahead reads metadata into memory. When readahead and statahead work well, a process that - accesses data finds that the information it needs is available immediately when requested in - memory without the delay of network I/O. - In Lustre release 2.2.0, the directory statahead feature was improved to - enhance directory traversal performance. The improvements primarily addressed two - issues: - - - A race condition existed between the statahead thread and other VFS operations while - processing asynchronous getattr RPC replies, causing duplicate - entries in dcache. This issue was resolved by using statahead local dcache. - - - File size/block attributes pre-fetching was not supported, so the traversing thread - had to send synchronous glimpse size RPCs to OST(s). This issue was resolved by using - asynchronous glimpse lock (AGL) RPCs to pre-fetch file size/block attributes from - OST(s). - - + File readahead and directory statahead enable reading of data + into memory before a process requests the data. File readahead prefetches + file content data into memory for read() related + calls, while directory statahead fetches file metadata into memory for + readdir() and stat() related + calls. When readahead and statahead work well, a process that accesses + data finds that the information it needs is available immediately in + memory on the client when requested without the delay of network I/O. +
Tuning File Readahead - File readahead is triggered when two or more sequential reads by an application fail - to be satisfied by data in the Linux buffer cache. The size of the initial readahead is 1 - MB. Additional readaheads grow linearly and increment until the readahead cache on the - client is full at 40 MB. + File readahead is triggered when two or more sequential reads + by an application fail to be satisfied by data in the Linux buffer + cache. The size of the initial readahead is determined by the RPC + size and the file stripe size, but will typically be at least 1 MiB. + Additional readaheads grow linearly and increment until the per-file + or per-system readahead cache limit on the client is reached. Readahead tunables include: - llite.fsname-instance.max_read_ahead_mb - - Controls the maximum amount of data readahead on a file. Files are read ahead in - RPC-sized chunks (1 MB or the size of the read() call, if larger) - after the second sequential read on a file descriptor. Random reads are done at the - size of the read() call only (no readahead). Reads to - non-contiguous regions of the file reset the readahead algorithm, and readahead is not - triggered again until sequential reads take place again. - To disable readahead, set this tunable to 0. The default value is 40 MB. + llite.fsname_instance.max_read_ahead_mb + - Controls the maximum amount of data readahead on a file. + Files are read ahead in RPC-sized chunks (4 MiB, or the size of + the read() call, if larger) after the second + sequential read on a file descriptor. Random reads are done at + the size of the read() call only (no + readahead). Reads to non-contiguous regions of the file reset + the readahead algorithm, and readahead is not triggered until + sequential reads take place again. + + + This is the global limit for all files and cannot be larger than + 1/2 of the client RAM. To disable readahead, set + max_read_ahead_mb=0. + - llite.fsname-instance.max_read_ahead_whole_mb - - Controls the maximum size of a file that is read in its entirety, regardless of the - size of the read(). + llite.fsname_instance.max_read_ahead_per_file_mb + - Controls the maximum number of megabytes (MiB) of data that + should be prefetched by the client when sequential reads are + detected on a file. This is the per-file readahead limit and + cannot be larger than max_read_ahead_mb. + + + + llite.fsname_instance.max_read_ahead_whole_mb + - Controls the maximum size of a file in MiB that is read in its + entirety upon access, regardless of the size of the + read() call. This avoids multiple small read + RPCs on relatively small files, when it is not possible to + efficiently detect a sequential read pattern before the whole + file has been read. + + The default value is the greater of 2 MiB or the size of one + RPC, as given by max_pages_per_rpc. +
Tuning Directory Statahead and AGL - Many system commands, such as ls –l, du, and - find, traverse a directory sequentially. To make these commands run - efficiently, the directory statahead and asynchronous glimpse lock (AGL) can be enabled to - improve the performance of traversing. + Many system commands, such as ls –l, + du, and find, traverse a + directory sequentially. To make these commands run efficiently, the + directory statahead can be enabled to improve the performance of + directory traversal. The statahead tunables are: - statahead_max - Controls whether directory statahead is enabled - and the maximum statahead window size (i.e., how many files can be pre-fetched by the - statahead thread). By default, statahead is enabled and the value of - statahead_max is 32. - To disable statahead, run: + statahead_max - + Controls the maximum number of file attributes that will be + prefetched by the statahead thread. By default, statahead is + enabled and statahead_max is 32 files. + To disable statahead, set statahead_max + to zero via the following command on the client: lctl set_param llite.*.statahead_max=0 - To set the maximum statahead window size (n), - run: + To change the maximum statahead window size on a client: lctl set_param llite.*.statahead_max=n - The maximum value of n is 8192. - The AGL can be controlled by entering: - lctl set_param llite.*.statahead_agl=n - The default value for n is 1, which enables the AGL. If - n is 0, the AGL is disabled. + The maximum statahead_max is 8192 files. + + The directory statahead thread will also prefetch the file + size/block attributes from the OSTs, so that all file attributes + are available on the client when requested by an application. + This is controlled by the asynchronous glimpse lock (AGL) setting. + The AGL behaviour can be disabled by setting: + lctl set_param llite.*.statahead_agl=0 - statahead_stats - A read-only interface that indicates the - current statahead and AGL statistics, such as how many times statahead/AGL has been - triggered since the last mount, how many statahead/AGL failures have occurred due to - an incorrect prediction or other causes. + statahead_stats - + A read-only interface that provides current statahead and AGL + statistics, such as how many times statahead/AGL has been triggered + since the last mount, how many statahead/AGL failures have occurred + due to an incorrect prediction or other causes. - The AGL is affected by statahead because the inodes processed by AGL are built - by the statahead thread, which means the statahead thread is the input of the AGL - pipeline. So if statahead is disabled, then the AGL is disabled by force. + AGL behaviour is affected by statahead since the inodes + processed by AGL are built by the statahead thread. If + statahead is disabled, then AGL is also disabled. @@ -1360,7 +1517,7 @@ write RPCs in flight: 0 To re-enable the writethrough cache on one OST, run: root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1 To check if the writethrough cache is enabled, run: - root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1 + root@oss1# lctl get_param obdfilter.*.writethrough_cache_enable readcache_max_filesize - Controls the maximum size of a file @@ -1449,6 +1606,114 @@ obdfilter.lol-OST0001.sync_journal=0 $ lctl get_param obdfilter.*.sync_on_lock_cancel obdfilter.lol-OST0001.sync_on_lock_cancel=never
+
+ + <indexterm> + <primary>proc</primary> + <secondary>client metadata performance</secondary> + </indexterm> + Tuning the Client Metadata RPC Stream + + The client metadata RPC stream represents the metadata RPCs issued + in parallel by a client to a MDT target. The metadata RPCs can be split + in two categories: the requests that do not modify the file system + (like getattr operation), and the requests that do modify the file system + (like create, unlink, setattr operations). To help optimize the client + metadata RPC stream, several tuning variables are provided to adjust + behavior according to network conditions and cluster size. + Note that increasing the number of metadata RPCs issued in parallel + might improve the performance metadata intensive parallel applications, + but as a consequence it will consume more memory on the client and on + the MDS. +
+ Configuring the Client Metadata RPC Stream + The MDC max_rpcs_in_flight parameter defines + the maximum number of metadata RPCs, both modifying and + non-modifying RPCs, that can be sent in parallel by a client to a MDT + target. This includes every file system metadata operations, such as + file or directory stat, creation, unlink. The default setting is 8, + minimum setting is 1 and maximum setting is 256. + To set the max_rpcs_in_flight parameter, run + the following command on the Lustre client: + client$ lctl set_param mdc.*.max_rpcs_in_flight=16 + The MDC max_mod_rpcs_in_flight parameter + defines the maximum number of file system modifying RPCs that can be + sent in parallel by a client to a MDT target. For example, the Lustre + client sends modify RPCs when it performs file or directory creation, + unlink, access permission modification or ownership modification. The + default setting is 7, minimum setting is 1 and maximum setting is + 256. + To set the max_mod_rpcs_in_flight parameter, + run the following command on the Lustre client: + client$ lctl set_param mdc.*.max_mod_rpcs_in_flight=12 + The max_mod_rpcs_in_flight value must be + strictly less than the max_rpcs_in_flight value. + It must also be less or equal to the MDT + max_mod_rpcs_per_client value. If one of theses + conditions is not enforced, the setting fails and an explicit message + is written in the Lustre log. + The MDT max_mod_rpcs_per_client parameter is a + tunable of the kernel module mdt that defines the + maximum number of file system modifying RPCs in flight allowed per + client. The parameter can be updated at runtime, but the change is + effective to new client connections only. The default setting is 8. + + To set the max_mod_rpcs_per_client parameter, + run the following command on the MDS: + mds$ echo 12 > /sys/module/mdt/parameters/max_mod_rpcs_per_client +
+
+ Monitoring the Client Metadata RPC Stream + The rpc_stats file contains histogram data + showing information about modify metadata RPCs. It can be helpful to + identify the level of parallelism achieved by an application doing + modify metadata operations. + Example: + client$ lctl get_param mdc.*.rpc_stats +snapshot_time: 1441876896.567070 (secs.usecs) +modify_RPCs_in_flight: 0 + + modify +rpcs in flight rpcs % cum % +0: 0 0 0 +1: 56 0 0 +2: 40 0 0 +3: 70 0 0 +4 41 0 0 +5: 51 0 1 +6: 88 0 1 +7: 366 1 2 +8: 1321 5 8 +9: 3624 15 23 +10: 6482 27 50 +11: 7321 30 81 +12: 4540 18 100 + The file information includes: + + + snapshot_time - UNIX epoch instant the + file was read. + + + modify_RPCs_in_flight - Number of modify + RPCs issued by the MDC, but not completed at the time of the + snapshot. This value should always be less than or equal to + max_mod_rpcs_in_flight. + + + rpcs in flight - Number of modify RPCs + that are pending when a RPC is sent, the relative percentage + (%) of total modify RPCs, and the cumulative + percentage (cum %) to that point. + + + If a large proportion of modify metadata RPCs are issued with a + number of pending metadata RPCs close to the + max_mod_rpcs_in_flight value, it means the + max_mod_rpcs_in_flight value could be increased to + improve the modify metadata performance. +
+
Configuring Timeouts in a Lustre File System @@ -1594,23 +1859,26 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never
Interpreting Adaptive Timeout Information - Adaptive timeout information can be obtained from the timeouts - files in /proc/fs/lustre/*/ on each server and client using the - lctl command. To read information from a timeouts - file, enter a command similar to: + Adaptive timeout information can be obtained via + lctl get_param {osc,mdc}.*.timeouts files on each + client and lctl get_param {ost,mds}.*.*.timeouts + on each server. To read information from a + timeouts file, enter a command similar to: # lctl get_param -n ost.*.ost_io.timeouts -service : cur 33 worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2 - In this example, the ost_io service on this node is currently - reporting an estimated RPC service time of 33 seconds. The worst RPC service time was 34 - seconds, which occurred 26 minutes ago. - The output also provides a history of service times. Four "bins" of adaptive - timeout history are shown, with the maximum RPC time in each bin reported. In both the - 0-150s bin and the 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the - worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a maximum of RPC time - of 2 seconds. The estimated service time is the maximum value across the four bins (33 - seconds in this example). - Service times (as reported by the servers) are also tracked in the client OBDs, as - shown in this example: +service : cur 33 worst 34 (at 1193427052, 1600s ago) 1 1 33 2 + In this example, the ost_io service on this + node is currently reporting an estimated RPC service time of 33 + seconds. The worst RPC service time was 34 seconds, which occurred + 26 minutes ago. + The output also provides a history of service times. + Four "bins" of adaptive timeout history are shown, with the + maximum RPC time in each bin reported. In both the 0-150s bin and the + 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the + worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a + maximum of RPC time of 2 seconds. The estimated service time is the + maximum value in the four bins (33 seconds in this example). + Service times (as reported by the servers) are also tracked in + the client OBDs, as shown in this example: # lctl get_param osc.*.timeouts last reply : 1193428639, 0d0h00m00s ago network : cur 1 worst 2 (at 1193427053, 0d0h26m26s ago) 1 1 1 1 @@ -1619,10 +1887,11 @@ portal 28 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 1 1 1 portal 7 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 0 1 1 portal 17 : cur 1 worst 1 (at 1193426177, 0d0h41m02s ago) 1 0 0 1 - In this example, portal 6, the ost_io service portal, shows the - history of service estimates reported by the portal. - Server statistic files also show the range of estimates including min, max, sum, and - sumsq. For example: + In this example, portal 6, the ost_io service + portal, shows the history of service estimates reported by the portal. + + Server statistic files also show the range of estimates including + min, max, sum, and sum-squared. For example: # lctl get_param mdt.*.mdt.stats ... req_timeout 6 samples [sec] 1 10 15 105 @@ -1649,9 +1918,9 @@ req_timeout 6 samples [sec] 1 10 15 105 messages or enable printing of D_NETERROR messages to the console using:lctl set_param printk=+neterror Congested routers can be a source of spurious LND timeouts. To avoid this - situation, increase the number of LNET router buffers to reduce back-pressure and/or + situation, increase the number of LNet router buffers to reduce back-pressure and/or increase LND timeouts on all nodes on all connected networks. Also consider increasing - the total number of LNET router nodes in the system so that the aggregate router + the total number of LNet router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth. @@ -1740,15 +2009,17 @@ req_timeout 6 samples [sec] 1 10 15 105
<indexterm> <primary>proc</primary> - <secondary>LNET</secondary> + <secondary>LNet</secondary> </indexterm><indexterm> - <primary>LNET</primary> + <primary>LNet</primary> <secondary>proc</secondary> - </indexterm>Monitoring LNET - LNET information is located in /proc/sys/lnet in these files: + Monitoring LNet + LNet information is located via lctl get_param + in these parameters: + - peers - Shows all NIDs known to this node and provides - information on the queue state. + peers - Shows all NIDs known to this node + and provides information on the queue state. Example: # lctl get_param peers nid refs state max rtr min tx min queue @@ -1860,7 +2131,7 @@ nid refs state max rtr min tx min queue Credits are initialized to allow a certain number of operations (in the example - above the table, eight as shown in the max column. LNET keeps track + above the table, eight as shown in the max column. LNet keeps track of the minimum number of credits ever seen over time showing the peak congestion that has occurred during the time monitored. Fewer available credits indicates a more congested resource. @@ -1875,7 +2146,7 @@ nid refs state max rtr min tx min queue credits (rtr/tx) that is less than max indicates operations are in progress. If the ratio rtr/tx is greater than max, operations are blocking. - LNET also limits concurrent sends and number of router buffers allocated to a single + LNet also limits concurrent sends and number of router buffers allocated to a single peer so that no peer can occupy all these resources. @@ -1976,31 +2247,45 @@ nid refs peer max tx min
-
+
<indexterm> <primary>proc</primary> <secondary>free space</secondary> </indexterm>Allocating Free Space on OSTs - Free space is allocated using either a round-robin or a weighted algorithm. The allocation - method is determined by the maximum amount of free-space imbalance between the OSTs. When free - space is relatively balanced across OSTs, the faster round-robin allocator is used, which - maximizes network balancing. The weighted allocator is used when any two OSTs are out of - balance by more than a specified threshold. - Free space distribution can be tuned using these two /proc - tunables: + Free space is allocated using either a round-robin or a weighted + algorithm. The allocation method is determined by the maximum amount of + free-space imbalance between the OSTs. When free space is relatively + balanced across OSTs, the faster round-robin allocator is used, which + maximizes network balancing. The weighted allocator is used when any two + OSTs are out of balance by more than a specified threshold. + Free space distribution can be tuned using these two + tunable parameters: - qos_threshold_rr - The threshold at which the allocation method - switches from round-robin to weighted is set in this file. The default is to switch to the - weighted algorithm when any two OSTs are out of balance by more than 17 percent. + lod.*.qos_threshold_rr - The threshold at which + the allocation method switches from round-robin to weighted is set + in this file. The default is to switch to the weighted algorithm when + any two OSTs are out of balance by more than 17 percent. - qos_prio_free - The weighting priority used by the weighted - allocator can be adjusted in this file. Increasing the value of - qos_prio_free puts more weighting on the amount of free space - available on each OST and less on how stripes are distributed across OSTs. The default - value is 91 percent. When the free space priority is set to 100, weighting is based - entirely on free space and location is no longer used by the striping algorthm. + lod.*.qos_prio_free - The weighting priority + used by the weighted allocator can be adjusted in this file. Increasing + the value of qos_prio_free puts more weighting on the + amount of free space available on each OST and less on how stripes are + distributed across OSTs. The default value is 91 percent weighting for + free space rebalancing and 9 percent for OST balancing. When the + free space priority is set to 100, weighting is based entirely on free + space and location is no longer used by the striping algorithm. + + + osp.*.reserved_mb_low + - The low watermark used to stop object allocation if available space + is less than this. The default is 0.1% of total OST size. + + + osp.*.reserved_mb_high + - The high watermark used to start object allocation if available + space is more than this. The default is 0.2% of total OST size. For more information about monitoring and managing free space, see proc locking Configuring Locking - The lru_size parameter is used to control the number of client-side - locks in an LRU cached locks queue. LRU size is dynamic, based on load to optimize the number - of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute - nodes vs. backup nodes). - The total number of locks available is a function of the server RAM. The default limit is - 50 locks/1 MB of RAM. If memory pressure is too high, the LRU size is shrunk. The number of - locks on the server is limited to the number of OSTs per - server * the number of clients * the value of the - lru_size - setting on the client as follows: + The lru_size parameter is used to control the + number of client-side locks in the LRU cached locks queue. LRU size is + normally dynamic, based on load to optimize the number of locks cached + on nodes that have different workloads (e.g., login/build nodes vs. + compute nodes vs. backup nodes). + The total number of locks available is a function of the server RAM. + The default limit is 50 locks/1 MB of RAM. If memory pressure is too high, + the LRU size is shrunk. The number of locks on the server is limited to + num_osts_per_oss * num_clients * lru_size + as follows: - To enable automatic LRU sizing, set the lru_size parameter to 0. In - this case, the lru_size parameter shows the current number of locks - being used on the export. LRU sizing is enabled by default. + To enable automatic LRU sizing, set the + lru_size parameter to 0. In this case, the + lru_size parameter shows the current number of locks + being used on the client. Dynamic LRU resizing is enabled by default. + - To specify a maximum number of locks, set the lru_size parameter to - a value other than zero but, normally, less than 100 * number of - CPUs in client. It is recommended that you only increase the LRU size on a - few login nodes where users access the file system interactively. + To specify a maximum number of locks, set the + lru_size parameter to a value other than zero. + A good default value for compute nodes is around + 100 * num_cpus. + It is recommended that you only set lru_size + to be signifivantly larger on a few login nodes where multiple + users access the file system interactively. - To clear the LRU on a single client, and, as a result, flush client cache without changing - the lru_size value, run: - $ lctl set_param ldlm.namespaces.osc_name|mdc_name.lru_size=clear - If the LRU size is set to be less than the number of existing unused locks, the unused - locks are canceled immediately. Use echo clear to cancel all locks without - changing the value. + To clear the LRU on a single client, and, as a result, flush client + cache without changing the lru_size value, run: + # lctl set_param ldlm.namespaces.osc_name|mdc_name.lru_size=clear + If the LRU size is set lower than the number of existing locks, + unused locks are canceled immediately. Use + clear to cancel all locks without changing the value. + - The lru_size parameter can only be set temporarily using - lctl set_param; it cannot be set permanently. + The lru_size parameter can only be set + temporarily using lctl set_param, it cannot be set + permanently. - To disable LRU sizing, on the Lustre clients, run: - $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) - Replace NR_CPU with the number of CPUs on - the node. - To determine the number of locks being granted, run: + To disable dynamic LRU resizing on the clients, run for example: + + # lctl set_param ldlm.namespaces.*osc*.lru_size=5000 + To determine the number of locks being granted with dynamic LRU + resizing, run: $ lctl get_param ldlm.namespaces.*.pool.limit + The lru_max_age parameter is used to control the + age of client-side locks in the LRU cached locks queue. This limits how + long unused locks are cached on the client, and avoids idle clients from + holding locks for an excessive time, which reduces memory usage on both + the client and server, as well as reducing work during server recovery. + + The lru_max_age is set and printed in milliseconds, + and by default is 3900000 ms (65 minutes). + Since Lustre 2.11, in addition to setting the + maximum lock age in milliseconds, it can also be set using a suffix of + s or ms to indicate seconds or + milliseconds, respectively. For example to set the client's maximum + lock age to 15 minutes (900s) run: + + +# lctl set_param ldlm.namespaces.*MDT*.lru_max_age=900s +# lctl get_param ldlm.namespaces.*MDT*.lru_max_age +ldlm.namespaces.myth-MDT0000-mdc-ffff8804296c2800.lru_max_age=900000 +
<indexterm> @@ -2077,7 +2387,7 @@ nid refs peer max tx min </row> <row> <entry> - <literal> mdt.MDS.mds </literal> + <literal> mds.MDS.mdt </literal> </entry> <entry> <para>Main metadata operations service</para> @@ -2085,7 +2395,7 @@ nid refs peer max tx min </row> <row> <entry> - <literal> mdt.MDS.mds_readpage </literal> + <literal> mds.MDS.mdt_readpage </literal> </entry> <entry> <para>Metadata <literal>readdir</literal> service</para> @@ -2093,7 +2403,7 @@ nid refs peer max tx min </row> <row> <entry> - <literal> mdt.MDS.mds_setattr </literal> + <literal> mds.MDS.mdt_setattr </literal> </entry> <entry> <para>Metadata <literal>setattr/close</literal> operations service </para> @@ -2142,15 +2452,23 @@ nid refs peer max tx min </tbody> </tgroup> </informaltable> - <para>For each service, an entry as shown below is - created:<screen>/proc/fs/lustre/<replaceable>{service}</replaceable>/*/thread_<replaceable>{min,max,started}</replaceable></screen></para> - <para>To temporarily set this tunable, run:</para> - <screen># lctl <replaceable>{get,set}</replaceable>_param <replaceable>{service}</replaceable>.thread_<replaceable>{min,max,started}</replaceable> </screen> - <para>To permanently set this tunable, run:</para> - <screen># lctl conf_param <replaceable>{service}</replaceable>.thread_<replaceable>{min,max,started}</replaceable> </screen> - <para>The following examples show how to set thread counts and get the number of running threads - for the service <literal>ost_io</literal> using the tunable - <literal>{service}.thread_{min,max,started}</literal>.</para> + <para>For each service, tunable parameters as shown below are available. + </para> + <itemizedlist> + <listitem> + <para>To temporarily set these tunables, run:</para> + <screen># lctl set_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started=num</replaceable> </screen> + </listitem> + <listitem> + <para>To permanently set this tunable, run:</para> + <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen> + <para condition='l25'>For version 2.5 or later, run: + <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para> + </listitem> + </itemizedlist> + <para>The following examples show how to set thread counts and get the number of running threads + for the service <literal>ost_io</literal> using the tunable + <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para> <itemizedlist> <listitem> <para>To get the number of running threads, run:</para> @@ -2169,6 +2487,13 @@ ost.OSS.ost_io.threads_max=512</screen> ost.OSS.ost_io.threads_max=256</screen> </listitem> <listitem> + <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para> + <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen> + <para condition='l25'>For version 2.5 or later, run: + <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256 +ost.OSS.ost_io.threads_max=256 </screen> </para> + </listitem> + <listitem> <para> To check if the <literal>threads_max</literal> setting is active, run:</para> <screen># lctl get_param ost.OSS.ost_io.threads_max ost.OSS.ost_io.threads_max=256</screen> @@ -2187,79 +2512,69 @@ ost.OSS.ost_io.threads_max=256</screen> <primary>proc</primary> <secondary>debug</secondary> </indexterm>Enabling and Interpreting Debugging Logs - By default, a detailed log of all operations is generated to aid in debugging. Flags that - control debugging are found in /proc/sys/lnet/debug. - The overhead of debugging can affect the performance of Lustre file system. Therefore, to - minimize the impact on performance, the debug level can be lowered, which affects the amount - of debugging information kept in the internal log buffer but does not alter the amount of - information to goes into syslog. You can raise the debug level when you need to collect logs - to debug problems. - The debugging mask can be set using "symbolic names". The symbolic format is - shown in the examples below. + By default, a detailed log of all operations is generated to aid in + debugging. Flags that control debugging are found via + lctl get_param debug. + The overhead of debugging can affect the performance of Lustre file + system. Therefore, to minimize the impact on performance, the debug level + can be lowered, which affects the amount of debugging information kept in + the internal log buffer but does not alter the amount of information to + goes into syslog. You can raise the debug level when you need to collect + logs to debug problems. + The debugging mask can be set using "symbolic names". The + symbolic format is shown in the examples below. + - To verify the debug level used, examine the sysctl that controls - debugging by running: - # sysctl lnet.debug -lnet.debug = ioctl neterror warning error emerg ha config console + To verify the debug level used, examine the parameter that + controls debugging by running: + # lctl get_param debug +debug= +ioctl neterror warning error emerg ha config console - To turn off debugging (except for network error debugging), run the following - command on all nodes concerned: + To turn off debugging except for network error debugging, run + the following command on all nodes concerned: # sysctl -w lnet.debug="neterror" -lnet.debug = neterror +debug=neterror - + + - To turn off debugging completely, run the following command on all nodes + To turn off debugging completely (except for the minimum error + reporting to the console), run the following command on all nodes concerned: - # sysctl -w lnet.debug=0 -lnet.debug = 0 + # lctl set_param debug=0 +debug=0 - To set an appropriate debug level for a production environment, run: - # sysctl -w lnet.debug="warning dlmtrace error emerg ha rpctrace vfstrace" -lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace - The flags shown in this example collect enough high-level information to aid - debugging, but they do not cause any serious performance impact. + To set an appropriate debug level for a production environment, + run: + # lctl set_param debug="warning dlmtrace error emerg ha rpctrace vfstrace" +debug=warning dlmtrace error emerg ha rpctrace vfstrace + The flags shown in this example collect enough high-level + information to aid debugging, but they do not cause any serious + performance impact. - - - To clear all flags and set new flags, run: - # sysctl -w lnet.debug="warning" -lnet.debug = warning - - + + - To add new flags to flags that have already been set, precede each one with a - "+": - # sysctl -w lnet.debug="+neterror +ha" -lnet.debug = +neterror +ha -# sysctl lnet.debug -lnet.debug = neterror warning ha + To add new flags to flags that have already been set, + precede each one with a "+": + # lctl set_param debug="+neterror +ha" +debug=+neterror +ha +# lctl get_param debug +debug=neterror warning error emerg ha console To remove individual flags, precede them with a "-": - # sysctl -w lnet.debug="-ha" -lnet.debug = -ha -# sysctl lnet.debug -lnet.debug = neterror warning + # lctl set_param debug="-ha" +debug=-ha +# lctl get_param debug +debug=neterror warning error emerg console - - To verify or change the debug level, run commands such as the following: : - # lctl get_param debug -debug= -neterror warning -# lctl set_param debug=+ha -# lctl get_param debug -debug= -neterror warning ha -# lctl set_param debug=-warning -# lctl get_param debug -debug= -neterror ha - - + + Debugging parameters include: @@ -2271,7 +2586,7 @@ neterror ha /tmp/lustre-log. - These parameters are also set using:sysctl -w lnet.debug={value} + These parameters can also be set using:sysctl -w lnet.debug={value} Additional useful parameters: panic_on_lbug - Causes ''panic'' to be called @@ -2307,11 +2622,12 @@ ost_set_info 1 obd_ping 212 Use the llstat utility to monitor statistics over time. To clear the statistics, use the -c option to - llstat. To specify how frequently the statistics should be reported (in - seconds), use the -i option. In the example below, the - -c option clears the statistics and -i10 option - reports statistics every 10 seconds: - $ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats + llstat. To specify how frequently the statistics + should be reported (in seconds), use the -i option. + In the example below, the -c option clears the + statistics and -i10 option reports statistics every + 10 seconds: +$ llstat -c -i10 ost_io /usr/bin/llstat: STATS on 06/06/07 /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp @@ -2555,8 +2871,9 @@ ost_write 21 2 59 [bytes] 7648424 15019 332725.08 910694 180397.87 See also (llobdstat) and (collectl). - MDT stats files can be used to track MDT statistics for the MDS. The - example below shows sample output from an MDT stats file. + MDT stats files can be used to track MDT + statistics for the MDS. The example below shows sample output from an + MDT stats file. # lctl get_param mds.*-MDT0000.stats snapshot_time 1244832003.676892 secs.usecs open 2 samples [reqs]