<?xml version='1.0' encoding='UTF-8'?>
-<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
- xml:lang="en-US" xml:id="lustreproc">
+<chapter xmlns="http://docbook.org/ns/docbook"
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="lustreproc">
<title xml:id="lustreproc.title">Lustre Parameters</title>
- <para>The <literal>/proc</literal> and <literal>/sys</literal> file systems
- acts as an interface to internal data structures in the kernel. This chapter
- describes parameters and tunables that are useful for optimizing and
- monitoring aspects of a Lustre file system. It includes these sections:</para>
+ <para>There are many parameters for Lustre that can tune client and server
+ performance, change behavior of the system, and report statistics about
+ various subsystems. This chapter describes the various parameters and
+ tunables that are useful for optimizing and monitoring aspects of a Lustre
+ file system. It includes these sections:</para>
<itemizedlist>
<listitem>
<para><xref linkend="dbdoclet.50438271_83523"/></para>
</para>
<para>Typically, metrics are accessed via <literal>lctl get_param</literal>
files and settings are changed by via <literal>lctl set_param</literal>.
+ They allow getting and setting multiple parameters with a single command,
+ through the use of wildcards in one or more part of the parameter name.
+ While each of these parameters maps to files in <literal>/proc</literal>
+ and <literal>/sys</literal> directly, the location of these parameters may
+ change between Lustre releases, so it is recommended to always use
+ <literal>lctl</literal> to access the parameters from userspace scripts.
Some data is server-only, some data is client-only, and some data is
exported from the client to the server and is thus duplicated in both
locations.</para>
<para>In the examples in this chapter, <literal>#</literal> indicates
a command is entered as root. Lustre servers are named according to the
convention <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
- The standard UNIX wildcard designation (*) is used.</para>
+ The standard UNIX wildcard designation (*) is used to represent any
+ part of a single component of the parameter name, excluding
+ "<literal>.</literal>" and "<literal>/</literal>".
+ It is also possible to use brace <literal>{}</literal>expansion
+ to specify a list of parameter names efficiently.</para>
</note>
<para>Some examples are shown below:</para>
<itemizedlist>
<listitem>
- <para> To obtain data from a Lustre client:</para>
- <screen># lctl list_param osc.*
-osc.testfs-OST0000-osc-ffff881071d5cc00
-osc.testfs-OST0001-osc-ffff881071d5cc00
-osc.testfs-OST0002-osc-ffff881071d5cc00
-osc.testfs-OST0003-osc-ffff881071d5cc00
-osc.testfs-OST0004-osc-ffff881071d5cc00
-osc.testfs-OST0005-osc-ffff881071d5cc00
-osc.testfs-OST0006-osc-ffff881071d5cc00
-osc.testfs-OST0007-osc-ffff881071d5cc00
-osc.testfs-OST0008-osc-ffff881071d5cc00</screen>
+ <para> To list available OST targets on a Lustre client:</para>
+ <screen># lctl list_param -F osc.*
+osc.testfs-OST0000-osc-ffff881071d5cc00/
+osc.testfs-OST0001-osc-ffff881071d5cc00/
+osc.testfs-OST0002-osc-ffff881071d5cc00/
+osc.testfs-OST0003-osc-ffff881071d5cc00/
+osc.testfs-OST0004-osc-ffff881071d5cc00/
+osc.testfs-OST0005-osc-ffff881071d5cc00/
+osc.testfs-OST0006-osc-ffff881071d5cc00/
+osc.testfs-OST0007-osc-ffff881071d5cc00/
+osc.testfs-OST0008-osc-ffff881071d5cc00/</screen>
<para>In this example, information about OST connections available
- on a client is displayed (indicated by "osc").</para>
+ on a client is displayed (indicated by "osc"). Each of these
+ connections may have numerous sub-parameters as well.</para>
</listitem>
</itemizedlist>
<itemizedlist>
</itemizedlist>
<itemizedlist>
<listitem>
+ <para> To see a specific subset of parameters, use braces, like:
+<screen># lctl list_param osc.*.{checksum,connect}*
+osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
+osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
+osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
+</screen></para>
+ </listitem>
+ </itemizedlist>
+ <itemizedlist>
+ <listitem>
<para> To view a specific file, use <literal>lctl get_param</literal>:
<screen># lctl get_param osc.lustre-OST0000*.rpc_stats</screen></para>
</listitem>
version and the Lustre version being used. The <literal>lctl</literal>
command insulates scripts from these changes and is preferred over direct
file access, unless as part of a high-performance monitoring system.
- In the <literal>cat</literal> command:</para>
- <itemizedlist>
- <listitem>
- <para>Replace the dots in the path with slashes.</para>
- </listitem>
- <listitem>
- <para>Prepend the path with the following as appropriate:
- <screen>/{proc,sys}/{fs,sys}/{lustre,lnet}</screen></para>
- </listitem>
- </itemizedlist>
- <para>For example, an <literal>lctl get_param</literal> command may look like
- this:<screen># lctl get_param osc.*.uuid
-osc.testfs-OST0000-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
-osc.testfs-OST0001-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
-...</screen></para>
- <para>The equivalent <literal>cat</literal> command may look like this:
- <screen># cat /proc/fs/lustre/osc/*/uuid
-594db456-0685-bd16-f59b-e72ee90e9819
-594db456-0685-bd16-f59b-e72ee90e9819
-...</screen></para>
- <para>or like this:
- <screen># cat /sys/fs/lustre/osc/*/uuid
-594db456-0685-bd16-f59b-e72ee90e9819
-594db456-0685-bd16-f59b-e72ee90e9819
-...</screen></para>
+ </para>
+ <note condition='l2c'><para>Starting in Lustre 2.12, there is
+ <literal>lctl get_param</literal> and <literal>lctl set_param</literal>
+ command can provide <emphasis>tab completion</emphasis> when using an
+ interactive shell with <literal>bash-completion</literal> installed.
+ This simplifies the use of <literal>get_param</literal> significantly,
+ since it provides an interactive list of available parameters.
+ </para></note>
<para>The <literal>llstat</literal> utility can be used to monitor some
Lustre file system I/O activity over a specified time period. For more
details, see
# hash ldlm_stats stats uuid</screen></para>
<section remap="h3">
<title>Identifying Lustre File Systems and Servers</title>
- <para>Several <literal>/proc</literal> files on the MGS list existing
+ <para>Several parameter files on the MGS list existing
Lustre file systems and file system servers. The examples below are for
a Lustre file system called
<literal>testfs</literal> with one MDT and three OSTs.</para>
notify_count: 4</screen>
</listitem>
<listitem>
- <para>To view the names of all live servers in the file system as listed in
- <literal>/proc/fs/lustre/devices</literal>, enter:</para>
+ <para>To list all configured devices on the local node, enter:</para>
<screen># lctl device_list
0 UP mgs MGS MGS 11
1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
<row>
<entry>
<para>
- <literal>mb_prealloc_table</literal></para>
+ <literal>prealloc_table</literal></para>
</entry>
<entry>
<para>A table of values used to preallocate space when a new request is received. By
</tgroup>
</informaltable>
<para>Buddy group cache information found in
- <literal>/proc/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
+ <literal>/sys/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
be useful for assessing on-disk fragmentation. For
example:<screen>cat /proc/fs/ldiskfs/loop0/mb_groups
#group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9
</itemizedlist></para>
</section>
<section>
- <title>Monitoring Lustre File System I/O</title>
+ <title>Monitoring Lustre File System I/O</title>
<para>A number of system utilities are provided to enable collection of data related to I/O
activity in a Lustre file system. In general, the data collected describes:</para>
<itemizedlist>
<primary>proc</primary>
<secondary>block I/O</secondary>
</indexterm>Monitoring the OST Block I/O Stream</title>
- <para>The <literal>brw_stats</literal> file in the <literal>obdfilter</literal> directory
- contains histogram data showing statistics for number of I/O requests sent to the disk,
- their size, and whether they are contiguous on the disk or not.</para>
+ <para>The <literal>brw_stats</literal> parameter file below the
+ <literal>osd-ldiskfs</literal> or <literal>osd-zfs</literal> directory
+ contains histogram data showing statistics for number of I/O requests
+ sent to the disk, their size, and whether they are contiguous on the
+ disk or not.</para>
<para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
- <para>Enter on the OSS:</para>
- <screen># lctl get_param obdfilter.testfs-OST0000.brw_stats
+ <para>Enter on the OSS or MDS:</para>
+ <screen>oss# lctl get_param osd-*.*.brw_stats
snapshot_time: 1372775039.769045 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
512K: 0 0 100 | 24 0 0
1M: 0 0 100 | 23142 99 100
</screen>
- <para>The tabular data is described in the table below. Each row in the table shows the number
- of reads and writes occurring for the statistic (<literal>ios</literal>), the relative
- percentage of total reads or writes (<literal>%</literal>), and the cumulative percentage to
- that point in the table for the statistic (<literal>cum %</literal>). </para>
+ <para>The tabular data is described in the table below. Each row in the
+ table shows the number of reads and writes occurring for the statistic
+ (<literal>ios</literal>), the relative percentage of total reads or
+ writes (<literal>%</literal>), and the cumulative percentage to that
+ point in the table for the statistic (<literal>cum %</literal>). </para>
<informaltable frame="all">
<tgroup cols="2">
<colspec colname="c1" colwidth="40*"/>
</section>
<section>
<title>Tuning Lustre File System I/O</title>
- <para>Each OSC has its own tree of tunables. For example:</para>
- <screen>$ ls -d /proc/fs/testfs/osc/OSC_client_ost1_MNT_client_2 /localhost
-/proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost
-/proc/fs/testfs/osc/OSC_uml0_ost2_MNT_localhost
-/proc/fs/testfs/osc/OSC_uml0_ost3_MNT_localhost
-
-$ ls /proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost
-blocksizefilesfree max_dirty_mb ost_server_uuid stats
-
-...</screen>
- <para>The following sections describe some of the parameters that can be tuned in a Lustre file
- system.</para>
+ <para>Each OSC has its own tree of tunables. For example:</para>
+ <screen>$ lctl lctl list_param osc.*.*
+osc.myth-OST0000-osc-ffff8804296c2800.active
+osc.myth-OST0000-osc-ffff8804296c2800.blocksize
+osc.myth-OST0000-osc-ffff8804296c2800.checksum_dump
+osc.myth-OST0000-osc-ffff8804296c2800.checksum_type
+osc.myth-OST0000-osc-ffff8804296c2800.checksums
+osc.myth-OST0000-osc-ffff8804296c2800.connect_flags
+:
+:
+osc.myth-OST0000-osc-ffff8804296c2800.state
+osc.myth-OST0000-osc-ffff8804296c2800.stats
+osc.myth-OST0000-osc-ffff8804296c2800.timeouts
+osc.myth-OST0000-osc-ffff8804296c2800.unstable_stats
+osc.myth-OST0000-osc-ffff8804296c2800.uuid
+osc.myth-OST0001-osc-ffff8804296c2800.active
+osc.myth-OST0001-osc-ffff8804296c2800.blocksize
+osc.myth-OST0001-osc-ffff8804296c2800.checksum_dump
+osc.myth-OST0001-osc-ffff8804296c2800.checksum_type
+:
+:
+</screen>
+ <para>The following sections describe some of the parameters that can
+ be tuned in a Lustre file system.</para>
<section remap="h3" xml:id="TuningClientIORPCStream">
<title><indexterm>
<primary>proc</primary>
<secondary>RPC tunables</secondary>
</indexterm>Tuning the Client I/O RPC Stream</title>
- <para>Ideally, an optimal amount of data is packed into each I/O RPC and a consistent number
- of issued RPCs are in progress at any time. To help optimize the client I/O RPC stream,
- several tuning variables are provided to adjust behavior according to network conditions and
- cluster size. For information about monitoring the client I/O RPC stream, see <xref
+ <para>Ideally, an optimal amount of data is packed into each I/O RPC
+ and a consistent number of issued RPCs are in progress at any time.
+ To help optimize the client I/O RPC stream, several tuning variables
+ are provided to adjust behavior according to network conditions and
+ cluster size. For information about monitoring the client I/O RPC
+ stream, see <xref
xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"/>.</para>
<para>RPC stream tunables include:</para>
<para>
<itemizedlist>
<listitem>
- <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal> -
- Controls how many MBs of dirty data can be written and queued up in the OSC. POSIX
- file writes that are cached contribute to this count. When the limit is reached,
- additional writes stall until previously-cached writes are written to the server. This
- may be changed by writing a single ASCII integer to the file. Only values between 0
- and 2048 or 1/4 of RAM are allowable. If 0 is specified, no writes are cached.
- Performance suffers noticeably unless you use large writes (1 MB or more).</para>
- <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is
- recommended to be 4 * <literal>max_pages_per_rpc </literal>*
- <literal>max_rpcs_in_flight</literal>.</para>
+ <para><literal>osc.<replaceable>osc_instance</replaceable>.checksums</literal>
+ - Controls whether the client will calculate data integrity
+ checksums for the bulk data transferred to the OST. Data
+ integrity checksums are enabled by default. The algorithm used
+ can be set using the <literal>checksum_type</literal> parameter.
+ </para>
+ </listitem>
+ <listitem>
+ <para><literal>osc.<replaceable>osc_instance</replaceable>.checksum_type</literal>
+ - Controls the data integrity checksum algorithm used by the
+ client. The available algorithms are determined by the set of
+ algorihtms. The checksum algorithm used by default is determined
+ by first selecting the fastest algorithms available on the OST,
+ and then selecting the fastest of those algorithms on the client,
+ which depends on available optimizations in the CPU hardware and
+ kernel. The default algorithm can be overridden by writing the
+ algorithm name into the <literal>checksum_type</literal>
+ parameter. Available checksum types can be seen on the client by
+ reading the <literal>checksum_type</literal> parameter. Currently
+ supported checksum types are:
+ <literal>adler</literal>,
+ <literal>crc32</literal>,
+ <literal>crc32c</literal>
+ </para>
+ <para condition="l2C">
+ In Lustre release 2.12 additional checksum types were added to
+ allow end-to-end checksum integration with T10-PI capable
+ hardware. The client will compute the appropriate checksum
+ type, based on the checksum type used by the storage, for the
+ RPC checksum, which will be verified by the server and passed
+ on to the storage. The T10-PI checksum types are:
+ <literal>t10ip512</literal>,
+ <literal>t10ip4K</literal>,
+ <literal>t10crc512</literal>,
+ <literal>t10crc4K</literal>
+ </para>
+ </listitem>
+ <listitem>
+ <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal>
+ - Controls how many MiB of dirty data can be written into the
+ client pagecache for writes by <emphasis>each</emphasis> OSC.
+ When this limit is reached, additional writes block until
+ previously-cached data is written to the server. This may be
+ changed by the <literal>lctl set_param</literal> command. Only
+ values larger than 0 and smaller than the lesser of 2048 MiB or
+ 1/4 of client RAM are valid. Performance can suffers if the
+ client cannot aggregate enough data per OSC to form a full RPC
+ (as set by the <literal>max_pages_per_rpc</literal>) parameter,
+ unless the application is doing very large writes itself.
+ </para>
+ <para>To maximize performance, the value for
+ <literal>max_dirty_mb</literal> is recommended to be at least
+ 4 * <literal>max_pages_per_rpc</literal> *
+ <literal>max_rpcs_in_flight</literal>.
+ </para>
</listitem>
<listitem>
- <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal> - A
- read-only value that returns the current number of bytes written and cached on this
- OSC.</para>
+ <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal>
+ - A read-only value that returns the current number of bytes
+ written and cached by this OSC.
+ </para>
</listitem>
<listitem>
- <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal> -
- The maximum number of pages that will undergo I/O in a single RPC to the OST. The
- minimum setting is a single page and the maximum setting is 1024 (for systems with a
- <literal>PAGE_SIZE</literal> of 4 KB), with the default maximum of 1 MB in the RPC.
- It is also possible to specify a units suffix (e.g. <literal>4M</literal>), so that
- the RPC size can be specified independently of the client
- <literal>PAGE_SIZE</literal>.</para>
+ <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal>
+ - The maximum number of pages that will be sent in a single RPC
+ request to the OST. The minimum value is one page and the maximum
+ value is 16 MiB (4096 on systems with <literal>PAGE_SIZE</literal>
+ of 4 KiB), with the default value of 4 MiB in one RPC. The upper
+ limit may also be constrained by <literal>ofd.*.brw_size</literal>
+ setting on the OSS, and applies to all clients connected to that
+ OST. It is also possible to specify a units suffix (e.g.
+ <literal>max_pages_per_rpc=4M</literal>), so the RPC size can be
+ set independently of the client <literal>PAGE_SIZE</literal>.
+ </para>
</listitem>
<listitem>
<para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
- - The maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC
- tries to initiate an RPC but finds that it already has the same number of RPCs
- outstanding, it will wait to issue further RPCs until some complete. The minimum
- setting is 1 and maximum setting is 256. </para>
+ - The maximum number of concurrent RPCs in flight from an OSC to
+ its OST. If the OSC tries to initiate an RPC but finds that it
+ already has the same number of RPCs outstanding, it will wait to
+ issue further RPCs until some complete. The minimum setting is 1
+ and maximum setting is 256. The default value is 8 RPCs.
+ </para>
<para>To improve small file I/O performance, increase the
- <literal>max_rpcs_in_flight</literal> value.</para>
+ <literal>max_rpcs_in_flight</literal> value.
+ </para>
</listitem>
<listitem>
- <para><literal>llite.<replaceable>fsname-instance</replaceable>/max_cache_mb</literal> -
- Maximum amount of inactive data cached by the client (default is 3/4 of RAM). For
- example:</para>
- <screen># lctl get_param llite.testfs-ce63ca00.max_cached_mb
-128</screen>
+ <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_cached_mb</literal>
+ - Maximum amount of read+write data cached by the client. The
+ default value is 1/2 of the client RAM.
+ </para>
</listitem>
</itemizedlist>
</para>
<note>
- <para>The value for <literal><replaceable>osc_instance</replaceable></literal> is typically
- <literal><replaceable>fsname</replaceable>-OST<replaceable>ost_index</replaceable>-osc-<replaceable>mountpoint_instance</replaceable></literal>,
- where the value for <literal><replaceable>mountpoint_instance</replaceable></literal> is
- unique to each mount point to allow associating osc, mdc, lov, lmv, and llite parameters
- with the same mount point. For
- example:<screen>lctl get_param osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats
+ <para>The value for <literal><replaceable>osc_instance</replaceable></literal>
+ and <literal><replaceable>fsname_instance</replaceable></literal>
+ are unique to each mount point to allow associating osc, mdc, lov,
+ lmv, and llite parameters with the same mount point. However, it is
+ common for scripts to use a wildcard <literal>*</literal> or a
+ filesystem-specific wildcard
+ <literal><replaceable>fsname-*</replaceable></literal> to specify
+ the parameter settings uniformly on all clients. For example:
+<screen>
+client$ lctl get_param osc.testfs-OST0000*.rpc_stats
osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats=
snapshot_time: 1375743284.337839 (secs.usecs)
read RPCs in flight: 0
</screen></para>
</note>
</section>
- <section remap="h3">
+ <section remap="h3" xml:id="TuningClientReadahead">
<title><indexterm>
<primary>proc</primary>
<secondary>readahead</secondary>
<section remap="h4">
<title>Tuning File Readahead</title>
<para>File readahead is triggered when two or more sequential reads
- by an application fail to be satisfied by data in the Linux buffer
- cache. The size of the initial readahead is 1 MB. Additional
- readaheads grow linearly and increment until the readahead cache on
- the client is full at 40 MB.</para>
+ by an application fail to be satisfied by data in the Linux buffer
+ cache. The size of the initial readahead is determined by the RPC
+ size and the file stripe size, but will typically be at least 1 MiB.
+ Additional readaheads grow linearly and increment until the per-file
+ or per-system readahead cache limit on the client is reached.</para>
<para>Readahead tunables include:</para>
<itemizedlist>
<listitem>
- <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb</literal> -
- Controls the maximum amount of data readahead on a file.
- Files are read ahead in RPC-sized chunks (1 MB or the size of
+ <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_mb</literal>
+ - Controls the maximum amount of data readahead on a file.
+ Files are read ahead in RPC-sized chunks (4 MiB, or the size of
the <literal>read()</literal> call, if larger) after the second
sequential read on a file descriptor. Random reads are done at
the size of the <literal>read()</literal> call only (no
readahead). Reads to non-contiguous regions of the file reset
- the readahead algorithm, and readahead is not triggered again
- until sequential reads take place again.
+ the readahead algorithm, and readahead is not triggered until
+ sequential reads take place again.
+ </para>
+ <para>
+ This is the global limit for all files and cannot be larger than
+ 1/2 of the client RAM. To disable readahead, set
+ <literal>max_read_ahead_mb=0</literal>.
</para>
- <para>To disable readahead, set
- <literal>max_read_ahead_mb=0</literal>. The default value is 40 MB.
+ </listitem>
+ <listitem>
+ <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_per_file_mb</literal>
+ - Controls the maximum number of megabytes (MiB) of data that
+ should be prefetched by the client when sequential reads are
+ detected on a file. This is the per-file readahead limit and
+ cannot be larger than <literal>max_read_ahead_mb</literal>.
</para>
</listitem>
<listitem>
- <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb</literal> -
- Controls the maximum size of a file that is read in its entirety,
- regardless of the size of the <literal>read()</literal>. This
- avoids multiple small read RPCs on relatively small files, when
- it is not possible to efficiently detect a sequential read
- pattern before the whole file has been read.
+ <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_whole_mb</literal>
+ - Controls the maximum size of a file in MiB that is read in its
+ entirety upon access, regardless of the size of the
+ <literal>read()</literal> call. This avoids multiple small read
+ RPCs on relatively small files, when it is not possible to
+ efficiently detect a sequential read pattern before the whole
+ file has been read.
+ </para>
+ <para>The default value is the greater of 2 MiB or the size of one
+ RPC, as given by <literal>max_pages_per_rpc</literal>.
</para>
</listitem>
</itemizedlist>
<title><indexterm>
<primary>proc</primary>
<secondary>read cache</secondary>
- </indexterm>Tuning OSS Read Cache</title>
- <para>The OSS read cache feature provides read-only caching of data on an OSS. This
- functionality uses the Linux page cache to store the data and uses as much physical memory
+ </indexterm>Tuning Server Read Cache</title>
+ <para>The server read cache feature provides read-only caching of file
+ data on an OSS or MDS (for Data-on-MDT). This functionality uses the
+ Linux page cache to store the data and uses as much physical memory
as is allocated.</para>
- <para>OSS read cache improves Lustre file system performance in these situations:</para>
+ <para>The server read cache can improves Lustre file system performance
+ in these situations:</para>
<itemizedlist>
<listitem>
- <para>Many clients are accessing the same data set (as in HPC applications or when
- diskless clients boot from the Lustre file system).</para>
+ <para>Many clients are accessing the same data set (as in HPC
+ applications or when diskless clients boot from the Lustre file
+ system).</para>
</listitem>
<listitem>
- <para>One client is storing data while another client is reading it (i.e., clients are
- exchanging data via the OST).</para>
+ <para>One client is writing data while another client is reading
+ it (i.e., clients are exchanging data via the filesystem).</para>
</listitem>
<listitem>
<para>A client has very limited caching of its own.</para>
</listitem>
</itemizedlist>
- <para>OSS read cache offers these benefits:</para>
+ <para>The server read cache offers these benefits:</para>
<itemizedlist>
<listitem>
- <para>Allows OSTs to cache read data more frequently.</para>
+ <para>Allows servers to cache read data more frequently.</para>
</listitem>
<listitem>
- <para>Improves repeated reads to match network speeds instead of disk speeds.</para>
+ <para>Improves repeated reads to match network speeds instead of
+ storage speeds.</para>
</listitem>
<listitem>
- <para>Provides the building blocks for OST write cache (small-write aggregation).</para>
+ <para>Provides the building blocks for server write cache
+ (small-write aggregation).</para>
</listitem>
</itemizedlist>
<section remap="h4">
- <title>Using OSS Read Cache</title>
- <para>OSS read cache is implemented on the OSS, and does not require any special support on
- the client side. Since OSS read cache uses the memory available in the Linux page cache,
- the appropriate amount of memory for the cache should be determined based on I/O patterns;
- if the data is mostly reads, then more cache is required than would be needed for mostly
- writes.</para>
- <para>OSS read cache is managed using the following tunables:</para>
+ <title>Using Server Read Cache</title>
+ <para>The server read cache is implemented on the OSS and MDS, and does
+ not require any special support on the client side. Since the server
+ read cache uses the memory available in the Linux page cache, the
+ appropriate amount of memory for the cache should be determined based
+ on I/O patterns. If the data is mostly reads, then more cache is
+ beneficial on the server than would be needed for mostly writes.
+ </para>
+ <para>The server read cache is managed using the following tunables.
+ Many tunables are available for both <literal>osd-ldiskfs</literal>
+ and <literal>osd-zfs</literal>, but in some cases the implementation
+ of <literal>osd-zfs</literal> prevents their use.</para>
<itemizedlist>
<listitem>
- <para><literal>read_cache_enable</literal> - Controls whether data read from disk during
- a read request is kept in memory and available for later read requests for the same
- data, without having to re-read it from disk. By default, read cache is enabled
- (<literal>read_cache_enable=1</literal>).</para>
- <para>When the OSS receives a read request from a client, it reads data from disk into
- its memory and sends the data as a reply to the request. If read cache is enabled,
- this data stays in memory after the request from the client has been fulfilled. When
- subsequent read requests for the same data are received, the OSS skips reading data
- from disk and the request is fulfilled from the cached data. The read cache is managed
- by the Linux kernel globally across all OSTs on that OSS so that the least recently
- used cache pages are dropped from memory when the amount of free memory is running
- low.</para>
- <para>If read cache is disabled (<literal>read_cache_enable=0</literal>), the OSS
- discards the data after a read request from the client is serviced and, for subsequent
- read requests, the OSS again reads the data from disk.</para>
- <para>To disable read cache on all the OSTs of an OSS, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
- <para>To re-enable read cache on one OST, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
- <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
- <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
+ <para><literal>read_cache_enable</literal> - High-level control of
+ whether data read from storage during a read request is kept in
+ memory and available for later read requests for the same data,
+ without having to re-read it from storage. By default, read cache
+ is enabled (<literal>read_cache_enable=1</literal>) for HDD OSDs
+ and automatically disabled for flash OSDs
+ (<literal>nonrotational=1</literal>).
+ The read cache cannot be disabled for <literal>osd-zfs</literal>,
+ and as a result this parameter is unavailable for that backend.
+ </para>
+ <para>When the server receives a read request from a client,
+ it reads data from storage into its memory and sends the data
+ to the client. If read cache is enabled for the target,
+ and the RPC and object size also meet the other criterion below,
+ this data may stay in memory after the client request has
+ completed. If later read requests for the same data are received,
+ if the data is still in cache the server skips reading it from
+ storage. The cache is managed by the Linux kernel globally
+ across all targets on that server so that the infrequently used
+ cache pages are dropped from memory when the free memory is
+ running low.</para>
+ <para>If read cache is disabled
+ (<literal>read_cache_enable=0</literal>), or the read or object
+ is large enough that it will not benefit from caching, the server
+ discards the data after the read request from the client is
+ completed. For subsequent read requests the server again reads
+ the data from storage.</para>
+ <para>To disable read cache on all targets of a server, run:</para>
+ <screen>
+ oss1# lctl set_param osd-*.*.read_cache_enable=0
+ </screen>
+ <para>To re-enable read cache on one target, run:</para>
+ <screen>
+ oss1# lctl set_param osd-*.{target_name}.read_cache_enable=1
+ </screen>
+ <para>To check if read cache is enabled on targets on a server, run:
+ </para>
+ <screen>
+ oss1# lctl get_param osd-*.*.read_cache_enable
+ </screen>
</listitem>
<listitem>
- <para><literal>writethrough_cache_enable</literal> - Controls whether data sent to the
- OSS as a write request is kept in the read cache and available for later reads, or if
- it is discarded from cache when the write is completed. By default, the writethrough
- cache is enabled (<literal>writethrough_cache_enable=1</literal>).</para>
- <para>When the OSS receives write requests from a client, it receives data from the
- client into its memory and writes the data to disk. If the writethrough cache is
- enabled, this data stays in memory after the write request is completed, allowing the
- OSS to skip reading this data from disk if a later read request, or partial-page write
- request, for the same data is received.</para>
+ <para><literal>writethrough_cache_enable</literal> - High-level
+ control of whether data sent to the server as a write request is
+ kept in the read cache and available for later reads, or if it is
+ discarded when the write completes. By default, writethrough
+ cache is enabled (<literal>writethrough_cache_enable=1</literal>)
+ for HDD OSDs and automatically disabled for flash OSDs
+ (<literal>nonrotational=1</literal>).
+ The write cache cannot be disabled for <literal>osd-zfs</literal>,
+ and as a result this parameter is unavailable for that backend.
+ </para>
+ <para>When the server receives write requests from a client, it
+ fetches data from the client into its memory and writes the data
+ to storage. If the writethrough cache is enabled for the target,
+ and the RPC and object size meet the other criterion below,
+ this data may stay in memory after the write request has
+ completed. If later read or partial-block write requests for this
+ same data are received, if the data is still in cache the server
+ skips reading it from storage.
+ </para>
<para>If the writethrough cache is disabled
- (<literal>writethrough_cache_enabled=0</literal>), the OSS discards the data after
- the write request from the client is completed. For subsequent read requests, or
- partial-page write requests, the OSS must re-read the data from disk.</para>
- <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
- writes that would cause partial-page updates, or if the files written by one node are
- immediately being accessed by other nodes. Some examples where enabling writethrough
- cache might be useful include producer-consumer I/O models or shared-file writes with
- a different node doing I/O not aligned on 4096-byte boundaries. </para>
- <para>Disabling the writethrough cache is advisable when files are mostly written to the
- file system but are not re-read within a short time period, or files are only written
- and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
- <para>To disable the writethrough cache on all OSTs of an OSS, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
+ (<literal>writethrough_cache_enabled=0</literal>), or the
+ write or object is large enough that it will not benefit from
+ caching, the server discards the data after the write request
+ from the client is completed. For subsequent read requests, or
+ partial-page write requests, the server must re-read the data
+ from storage.</para>
+ <para>Enabling writethrough cache is advisable if clients are doing
+ small or unaligned writes that would cause partial-page updates,
+ or if the files written by one node are immediately being read by
+ other nodes. Some examples where enabling writethrough cache
+ might be useful include producer-consumer I/O models or
+ shared-file writes that are not aligned on 4096-byte boundaries.
+ </para>
+ <para>Disabling the writethrough cache is advisable when files are
+ mostly written to the file system but are not re-read within a
+ short time period, or files are only written and re-read by the
+ same node, regardless of whether the I/O is aligned or not.</para>
+ <para>To disable writethrough cache on all targets on a server, run:
+ </para>
+ <screen>
+ oss1# lctl set_param osd-*.*.writethrough_cache_enable=0
+ </screen>
<para>To re-enable the writethrough cache on one OST, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
+ <screen>
+ oss1# lctl set_param osd-*.{OST_name}.writethrough_cache_enable=1
+ </screen>
<para>To check if the writethrough cache is enabled, run:</para>
- <screen>root@oss1# lctl get_param obdfilter.*.writethrough_cache_enable</screen>
+ <screen>
+ oss1# lctl get_param osd-*.*.writethrough_cache_enable
+ </screen>
+ </listitem>
+ <listitem>
+ <para><literal>readcache_max_filesize</literal> - Controls the
+ maximum size of an object that both the read cache and
+ writethrough cache will try to keep in memory. Objects larger
+ than <literal>readcache_max_filesize</literal> will not be kept
+ in cache for either reads or writes regardless of the
+ <literal>read_cache_enable</literal> or
+ <literal>writethrough_cache_enable</literal> settings.</para>
+ <para>Setting this tunable can be useful for workloads where
+ relatively small objects are repeatedly accessed by many clients,
+ such as job startup objects, executables, log objects, etc., but
+ large objects are read or written only once. By not putting the
+ larger objects into the cache, it is much more likely that more
+ of the smaller objects will remain in cache for a longer time.
+ </para>
+ <para>When setting <literal>readcache_max_filesize</literal>,
+ the input value can be specified in bytes, or can have a suffix
+ to indicate other binary units such as
+ <literal>K</literal> (kibibytes),
+ <literal>M</literal> (mebibytes),
+ <literal>G</literal> (gibibytes),
+ <literal>T</literal> (tebibytes), or
+ <literal>P</literal> (pebibytes).</para>
+ <para>
+ To limit the maximum cached object size to 64 MiB on all OSTs of
+ a server, run:
+ </para>
+ <screen>
+ oss1# lctl set_param osd-*.*.readcache_max_filesize=64M
+ </screen>
+ <para>To disable the maximum cached object size on all targets, run:
+ </para>
+ <screen>
+ oss1# lctl set_param osd-*.*.readcache_max_filesize=-1
+ </screen>
+ <para>
+ To check the current maximum cached object size on all targets of
+ a server, run:
+ </para>
+ <screen>
+ oss1# lctl get_param osd-*.*.readcache_max_filesize
+ </screen>
+ </listitem>
+ <listitem>
+ <para><literal>readcache_max_io_mb</literal> - Controls the maximum
+ size of a single read IO that will be cached in memory. Reads
+ larger than <literal>readcache_max_io_mb</literal> will be read
+ directly from storage and bypass the page cache completely.
+ This avoids significant CPU overhead at high IO rates.
+ The read cache cannot be disabled for <literal>osd-zfs</literal>,
+ and as a result this parameter is unavailable for that backend.
+ </para>
+ <para>When setting <literal>readcache_max_io_mb</literal>, the
+ input value can be specified in mebibytes, or can have a suffix
+ to indicate other binary units such as
+ <literal>K</literal> (kibibytes),
+ <literal>M</literal> (mebibytes),
+ <literal>G</literal> (gibibytes),
+ <literal>T</literal> (tebibytes), or
+ <literal>P</literal> (pebibytes).</para>
</listitem>
<listitem>
- <para><literal>readcache_max_filesize</literal> - Controls the maximum size of a file
- that both the read cache and writethrough cache will try to keep in memory. Files
- larger than <literal>readcache_max_filesize</literal> will not be kept in cache for
- either reads or writes.</para>
- <para>Setting this tunable can be useful for workloads where relatively small files are
- repeatedly accessed by many clients, such as job startup files, executables, log
- files, etc., but large files are read or written only once. By not putting the larger
- files into the cache, it is much more likely that more of the smaller files will
- remain in cache for a longer time.</para>
- <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
- specified in bytes, or can have a suffix to indicate other binary units such as
- <literal>K</literal> (kilobytes), <literal>M</literal> (megabytes),
- <literal>G</literal> (gigabytes), <literal>T</literal> (terabytes), or
- <literal>P</literal> (petabytes).</para>
- <para>To limit the maximum cached file size to 32 MB on all OSTs of an OSS, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
- <para>To disable the maximum cached file size on an OST, run:</para>
- <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
- <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
- <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
+ <para><literal>writethrough_max_io_mb</literal> - Controls the
+ maximum size of a single writes IO that will be cached in memory.
+ Writes larger than <literal>writethrough_max_io_mb</literal> will
+ be written directly to storage and bypass the page cache entirely.
+ This avoids significant CPU overhead at high IO rates.
+ The write cache cannot be disabled for <literal>osd-zfs</literal>,
+ and as a result this parameter is unavailable for that backend.
+ </para>
+ <para>When setting <literal>writethrough_max_io_mb</literal>, the
+ input value can be specified in mebibytes, or can have a suffix
+ to indicate other binary units such as
+ <literal>K</literal> (kibibytes),
+ <literal>M</literal> (mebibytes),
+ <literal>G</literal> (gibibytes),
+ <literal>T</literal> (tebibytes), or
+ <literal>P</literal> (pebibytes).</para>
</listitem>
</itemizedlist>
</section>
</informaltable>
<section>
<title>Interpreting Adaptive Timeout Information</title>
- <para>Adaptive timeout information can be obtained from the <literal>timeouts</literal>
- files in <literal>/proc/fs/lustre/*/</literal> on each server and client using the
- <literal>lctl</literal> command. To read information from a <literal>timeouts</literal>
- file, enter a command similar to:</para>
+ <para>Adaptive timeout information can be obtained via
+ <literal>lctl get_param {osc,mdc}.*.timeouts</literal> files on each
+ client and <literal>lctl get_param {ost,mds}.*.*.timeouts</literal>
+ on each server. To read information from a
+ <literal>timeouts</literal> file, enter a command similar to:</para>
<screen># lctl get_param -n ost.*.ost_io.timeouts
-service : cur 33 worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
- <para>In this example, the <literal>ost_io</literal> service on this node is currently
- reporting an estimated RPC service time of 33 seconds. The worst RPC service time was 34
- seconds, which occurred 26 minutes ago.</para>
- <para>The output also provides a history of service times. Four "bins" of adaptive
- timeout history are shown, with the maximum RPC time in each bin reported. In both the
- 0-150s bin and the 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
- worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a maximum of RPC time
- of 2 seconds. The estimated service time is the maximum value across the four bins (33
- seconds in this example).</para>
- <para>Service times (as reported by the servers) are also tracked in the client OBDs, as
- shown in this example:</para>
+service : cur 33 worst 34 (at 1193427052, 1600s ago) 1 1 33 2</screen>
+ <para>In this example, the <literal>ost_io</literal> service on this
+ node is currently reporting an estimated RPC service time of 33
+ seconds. The worst RPC service time was 34 seconds, which occurred
+ 26 minutes ago.</para>
+ <para>The output also provides a history of service times.
+ Four "bins" of adaptive timeout history are shown, with the
+ maximum RPC time in each bin reported. In both the 0-150s bin and the
+ 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
+ worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a
+ maximum of RPC time of 2 seconds. The estimated service time is the
+ maximum value in the four bins (33 seconds in this example).</para>
+ <para>Service times (as reported by the servers) are also tracked in
+ the client OBDs, as shown in this example:</para>
<screen># lctl get_param osc.*.timeouts
last reply : 1193428639, 0d0h00m00s ago
network : cur 1 worst 2 (at 1193427053, 0d0h26m26s ago) 1 1 1 1
portal 7 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 0 1 1
portal 17 : cur 1 worst 1 (at 1193426177, 0d0h41m02s ago) 1 0 0 1
</screen>
- <para>In this example, portal 6, the <literal>ost_io</literal> service portal, shows the
- history of service estimates reported by the portal.</para>
- <para>Server statistic files also show the range of estimates including min, max, sum, and
- sumsq. For example:</para>
+ <para>In this example, portal 6, the <literal>ost_io</literal> service
+ portal, shows the history of service estimates reported by the portal.
+ </para>
+ <para>Server statistic files also show the range of estimates including
+ min, max, sum, and sum-squared. For example:</para>
<screen># lctl get_param mdt.*.mdt.stats
...
req_timeout 6 samples [sec] 1 10 15 105
<primary>LNet</primary>
<secondary>proc</secondary>
</indexterm>Monitoring LNet</title>
- <para>LNet information is located in <literal>/proc/sys/lnet</literal> in these files:<itemizedlist>
+ <para>LNet information is located via <literal>lctl get_param</literal>
+ in these parameters:
+ <itemizedlist>
<listitem>
- <para><literal>peers</literal> - Shows all NIDs known to this node and provides
- information on the queue state.</para>
+ <para><literal>peers</literal> - Shows all NIDs known to this node
+ and provides information on the queue state.</para>
<para>Example:</para>
<screen># lctl get_param peers
nid refs state max rtr min tx min queue
<literal>rtr </literal></para>
</entry>
<entry>
- <para>Number of routing buffer credits.</para>
+ <para>Number of available routing buffer credits.</para>
</entry>
</row>
<row>
<literal>tx </literal></para>
</entry>
<entry>
- <para>Number of send credits.</para>
+ <para>Number of available send credits.</para>
</entry>
</row>
<row>
</tbody>
</tgroup>
</informaltable>
- <para>Credits are initialized to allow a certain number of operations (in the example
- above the table, eight as shown in the <literal>max</literal> column. LNet keeps track
- of the minimum number of credits ever seen over time showing the peak congestion that
- has occurred during the time monitored. Fewer available credits indicates a more
- congested resource. </para>
- <para>The number of credits currently in flight (number of transmit credits) is shown in
- the <literal>tx</literal> column. The maximum number of send credits available is shown
- in the <literal>max</literal> column and never changes. The number of router buffers
- available for consumption by a peer is shown in the <literal>rtr</literal>
- column.</para>
- <para>Therefore, <literal>rtr</literal> – <literal>tx</literal> is the number of transmits
- in flight. Typically, <literal>rtr == max</literal>, although a configuration can be set
- such that <literal>max >= rtr</literal>. The ratio of routing buffer credits to send
- credits (<literal>rtr/tx</literal>) that is less than <literal>max</literal> indicates
- operations are in progress. If the ratio <literal>rtr/tx</literal> is greater than
- <literal>max</literal>, operations are blocking.</para>
- <para>LNet also limits concurrent sends and number of router buffers allocated to a single
- peer so that no peer can occupy all these resources.</para>
+ <para>Credits are initialized to allow a certain number of operations
+ (in the example above the table, eight as shown in the
+ <literal>max</literal> column. LNet keeps track of the minimum
+ number of credits ever seen over time showing the peak congestion
+ that has occurred during the time monitored. Fewer available credits
+ indicates a more congested resource. </para>
+ <para>The number of credits currently available is shown in the
+ <literal>tx</literal> column. The maximum number of send credits is
+ shown in the <literal>max</literal> column and never changes. The
+ number of currently active transmits can be derived by
+ <literal>(max - tx)</literal>, as long as
+ <literal>tx</literal> is greater than or equal to 0. Once
+ <literal>tx</literal> is less than 0, it indicates the number of
+ transmits on that peer which have been queued for lack of credits.
+ </para>
+ <para>The number of router buffer credits available for consumption
+ by a peer is shown in <literal>rtr</literal> column. The number of
+ routing credits can be configured separately at the LND level or at
+ the LNet level by using the <literal>peer_buffer_credits</literal>
+ module parameter for the appropriate module. If the routing credits
+ is not set explicitly, it'll default to the maximum transmit credits
+ defined by <literal>peer_credits</literal> module parameter.
+ Whenever a gateway routes a message from a peer, it decrements the
+ number of available routing credits for that peer. If that value
+ goes to zero, then messages will be queued. Negative values show the
+ number of queued message waiting to be routed. The number of
+ messages which are currently being routed from a peer can be derived
+ by <literal>(max_rtr_credits - rtr)</literal>.</para>
+ <para>LNet also limits concurrent sends and number of router buffers
+ allocated to a single peer so that no peer can occupy all resources.
+ </para>
</listitem>
<listitem>
- <para><literal>nis</literal> - Shows the current queue health on this node.</para>
+ <para><literal>nis</literal> - Shows current queue health on the node.
+ </para>
<para>Example:</para>
<screen># lctl get_param nis
nid refs peer max tx min
</listitem>
</itemizedlist></para>
</section>
- <section remap="h3">
+ <section remap="h3" xml:id="dbdoclet.balancing_free_space">
<title><indexterm>
<primary>proc</primary>
<secondary>free space</secondary>
</indexterm>Allocating Free Space on OSTs</title>
- <para>Free space is allocated using either a round-robin or a weighted algorithm. The allocation
- method is determined by the maximum amount of free-space imbalance between the OSTs. When free
- space is relatively balanced across OSTs, the faster round-robin allocator is used, which
- maximizes network balancing. The weighted allocator is used when any two OSTs are out of
- balance by more than a specified threshold.</para>
- <para>Free space distribution can be tuned using these two <literal>/proc</literal>
- tunables:</para>
+ <para>Free space is allocated using either a round-robin or a weighted
+ algorithm. The allocation method is determined by the maximum amount of
+ free-space imbalance between the OSTs. When free space is relatively
+ balanced across OSTs, the faster round-robin allocator is used, which
+ maximizes network balancing. The weighted allocator is used when any two
+ OSTs are out of balance by more than a specified threshold.</para>
+ <para>Free space distribution can be tuned using these two
+ tunable parameters:</para>
<itemizedlist>
<listitem>
- <para><literal>qos_threshold_rr</literal> - The threshold at which the allocation method
- switches from round-robin to weighted is set in this file. The default is to switch to the
- weighted algorithm when any two OSTs are out of balance by more than 17 percent.</para>
+ <para><literal>lod.*.qos_threshold_rr</literal> - The threshold at which
+ the allocation method switches from round-robin to weighted is set
+ in this file. The default is to switch to the weighted algorithm when
+ any two OSTs are out of balance by more than 17 percent.</para>
</listitem>
<listitem>
- <para><literal>qos_prio_free</literal> - The weighting priority used by the weighted
- allocator can be adjusted in this file. Increasing the value of
- <literal>qos_prio_free</literal> puts more weighting on the amount of free space
- available on each OST and less on how stripes are distributed across OSTs. The default
- value is 91 percent. When the free space priority is set to 100, weighting is based
- entirely on free space and location is no longer used by the striping algorithm.</para>
+ <para><literal>lod.*.qos_prio_free</literal> - The weighting priority
+ used by the weighted allocator can be adjusted in this file. Increasing
+ the value of <literal>qos_prio_free</literal> puts more weighting on the
+ amount of free space available on each OST and less on how stripes are
+ distributed across OSTs. The default value is 91 percent weighting for
+ free space rebalancing and 9 percent for OST balancing. When the
+ free space priority is set to 100, weighting is based entirely on free
+ space and location is no longer used by the striping algorithm.</para>
</listitem>
<listitem>
- <para condition="l29"><literal>reserved_mb_low</literal> - The low watermark used to stop
- object allocation if available space is less than it. The default is 0.1 percent of total
- OST size.</para>
+ <para condition="l29"><literal>osp.*.reserved_mb_low</literal>
+ - The low watermark used to stop object allocation if available space
+ is less than this. The default is 0.1% of total OST size.</para>
</listitem>
<listitem>
- <para condition="l29"><literal>reserved_mb_high</literal> - The high watermark used to start
- object allocation if available space is more than it. The default is 0.2 percent of total
- OST size.</para>
+ <para condition="l29"><literal>osp.*.reserved_mb_high</literal>
+ - The high watermark used to start object allocation if available
+ space is more than this. The default is 0.2% of total OST size.</para>
</listitem>
</itemizedlist>
- <para>For more information about monitoring and managing free space, see <xref
- xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438209_10424"/>.</para>
+ <para>For more information about monitoring and managing free space, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="file_striping.managing_free_space"/>.</para>
</section>
<section remap="h3">
<title><indexterm>
<primary>proc</primary>
<secondary>locking</secondary>
</indexterm>Configuring Locking</title>
- <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
- locks in an LRU cached locks queue. LRU size is dynamic, based on load to optimize the number
- of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
- nodes vs. backup nodes).</para>
- <para>The total number of locks available is a function of the server RAM. The default limit is
- 50 locks/1 MB of RAM. If memory pressure is too high, the LRU size is shrunk. The number of
- locks on the server is limited to <emphasis role="italic">the number of OSTs per
- server</emphasis> * <emphasis role="italic">the number of clients</emphasis> * <emphasis
- role="italic">the value of the</emphasis>
- <literal>lru_size</literal>
- <emphasis role="italic">setting on the client</emphasis> as follows: </para>
+ <para>The <literal>lru_size</literal> parameter is used to control the
+ number of client-side locks in the LRU cached locks queue. LRU size is
+ normally dynamic, based on load to optimize the number of locks cached
+ on nodes that have different workloads (e.g., login/build nodes vs.
+ compute nodes vs. backup nodes).</para>
+ <para>The total number of locks available is a function of the server RAM.
+ The default limit is 50 locks/1 MB of RAM. If memory pressure is too high,
+ the LRU size is shrunk. The number of locks on the server is limited to
+ <replaceable>num_osts_per_oss * num_clients * lru_size</replaceable>
+ as follows: </para>
<itemizedlist>
<listitem>
- <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In
- this case, the <literal>lru_size</literal> parameter shows the current number of locks
- being used on the export. LRU sizing is enabled by default.</para>
+ <para>To enable automatic LRU sizing, set the
+ <literal>lru_size</literal> parameter to 0. In this case, the
+ <literal>lru_size</literal> parameter shows the current number of locks
+ being used on the client. Dynamic LRU resizing is enabled by default.
+ </para>
</listitem>
<listitem>
- <para>To specify a maximum number of locks, set the <literal>lru_size</literal> parameter to
- a value other than zero but, normally, less than 100 * <emphasis role="italic">number of
- CPUs in client</emphasis>. It is recommended that you only increase the LRU size on a
- few login nodes where users access the file system interactively.</para>
+ <para>To specify a maximum number of locks, set the
+ <literal>lru_size</literal> parameter to a value other than zero.
+ A good default value for compute nodes is around
+ <literal>100 * <replaceable>num_cpus</replaceable></literal>.
+ It is recommended that you only set <literal>lru_size</literal>
+ to be signifivantly larger on a few login nodes where multiple
+ users access the file system interactively.</para>
</listitem>
</itemizedlist>
- <para>To clear the LRU on a single client, and, as a result, flush client cache without changing
- the <literal>lru_size</literal> value, run:</para>
- <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
- <para>If the LRU size is set to be less than the number of existing unused locks, the unused
- locks are canceled immediately. Use <literal>echo clear</literal> to cancel all locks without
- changing the value.</para>
+ <para>To clear the LRU on a single client, and, as a result, flush client
+ cache without changing the <literal>lru_size</literal> value, run:</para>
+ <screen># lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
+ <para>If the LRU size is set lower than the number of existing locks,
+ <emphasis>unused</emphasis> locks are canceled immediately. Use
+ <literal>clear</literal> to cancel all locks without changing the value.
+ </para>
<note>
- <para>The <literal>lru_size</literal> parameter can only be set temporarily using
- <literal>lctl set_param</literal>; it cannot be set permanently.</para>
+ <para>The <literal>lru_size</literal> parameter can only be set
+ temporarily using <literal>lctl set_param</literal>, it cannot be set
+ permanently.</para>
</note>
- <para>To disable LRU sizing, on the Lustre clients, run:</para>
- <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((<replaceable>NR_CPU</replaceable>*100))</screen>
- <para>Replace <literal><replaceable>NR_CPU</replaceable></literal> with the number of CPUs on
- the node.</para>
- <para>To determine the number of locks being granted, run:</para>
+ <para>To disable dynamic LRU resizing on the clients, run for example:
+ </para>
+ <screen># lctl set_param ldlm.namespaces.*osc*.lru_size=5000</screen>
+ <para>To determine the number of locks being granted with dynamic LRU
+ resizing, run:</para>
<screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
+ <para>The <literal>lru_max_age</literal> parameter is used to control the
+ age of client-side locks in the LRU cached locks queue. This limits how
+ long unused locks are cached on the client, and avoids idle clients from
+ holding locks for an excessive time, which reduces memory usage on both
+ the client and server, as well as reducing work during server recovery.
+ </para>
+ <para>The <literal>lru_max_age</literal> is set and printed in milliseconds,
+ and by default is 3900000 ms (65 minutes).</para>
+ <para condition='l2B'>Since Lustre 2.11, in addition to setting the
+ maximum lock age in milliseconds, it can also be set using a suffix of
+ <literal>s</literal> or <literal>ms</literal> to indicate seconds or
+ milliseconds, respectively. For example to set the client's maximum
+ lock age to 15 minutes (900s) run:
+ </para>
+ <screen>
+# lctl set_param ldlm.namespaces.*MDT*.lru_max_age=900s
+# lctl get_param ldlm.namespaces.*MDT*.lru_max_age
+ldlm.namespaces.myth-MDT0000-mdc-ffff8804296c2800.lru_max_age=900000
+ </screen>
</section>
<section xml:id="dbdoclet.50438271_87260">
<title><indexterm>
</tbody>
</tgroup>
</informaltable>
- <para>For each service, an entry as shown below is
- created:<screen>/proc/fs/lustre/<replaceable>service</replaceable>/*/threads_<replaceable>min|max|started</replaceable></screen></para>
+ <para>For each service, tunable parameters as shown below are available.
+ </para>
<itemizedlist>
<listitem>
- <para>To temporarily set this tunable, run:</para>
- <screen># lctl <replaceable>get|set</replaceable>_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
- </listitem>
+ <para>To temporarily set these tunables, run:</para>
+ <screen># lctl set_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started=num</replaceable> </screen>
+ </listitem>
<listitem>
- <para>To permanently set this tunable, run:</para>
- <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
- <para condition='l25'>For version 2.5 or later, run:
- <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
+ <para>To permanently set this tunable, run the following command on
+ the MGS:
+ <screen>mgs# lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
+ <para condition='l25'>For Lustre 2.5 or earlier, run:
+ <screen>mgs# lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable></screen>
+ </para>
</listitem>
</itemizedlist>
- <para>The following examples show how to set thread counts and get the number of running threads
- for the service <literal>ost_io</literal> using the tunable
- <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
+ <para>The following examples show how to set thread counts and get the
+ number of running threads for the service <literal>ost_io</literal>
+ using the tunable
+ <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
<itemizedlist>
<listitem>
<para>To get the number of running threads, run:</para>
<listitem>
<para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
<screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
- <para condition='l25'>For version 2.5 or later, run:
- <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
+ <para condition='l25'>For version 2.5 or later, run:
+ <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
ost.OSS.ost_io.threads_max=256 </screen> </para>
</listitem>
<listitem>
<primary>proc</primary>
<secondary>debug</secondary>
</indexterm>Enabling and Interpreting Debugging Logs</title>
- <para>By default, a detailed log of all operations is generated to aid in debugging. Flags that
- control debugging are found in <literal>/proc/sys/lnet/debug</literal>. </para>
- <para>The overhead of debugging can affect the performance of Lustre file system. Therefore, to
- minimize the impact on performance, the debug level can be lowered, which affects the amount
- of debugging information kept in the internal log buffer but does not alter the amount of
- information to goes into syslog. You can raise the debug level when you need to collect logs
- to debug problems. </para>
- <para>The debugging mask can be set using "symbolic names". The symbolic format is
- shown in the examples below.<itemizedlist>
+ <para>By default, a detailed log of all operations is generated to aid in
+ debugging. Flags that control debugging are found via
+ <literal>lctl get_param debug</literal>.</para>
+ <para>The overhead of debugging can affect the performance of Lustre file
+ system. Therefore, to minimize the impact on performance, the debug level
+ can be lowered, which affects the amount of debugging information kept in
+ the internal log buffer but does not alter the amount of information to
+ goes into syslog. You can raise the debug level when you need to collect
+ logs to debug problems. </para>
+ <para>The debugging mask can be set using "symbolic names". The
+ symbolic format is shown in the examples below.
+ <itemizedlist>
<listitem>
- <para>To verify the debug level used, examine the <literal>sysctl</literal> that controls
- debugging by running:</para>
- <screen># sysctl lnet.debug
-lnet.debug = ioctl neterror warning error emerg ha config console</screen>
+ <para>To verify the debug level used, examine the parameter that
+ controls debugging by running:</para>
+ <screen># lctl get_param debug
+debug=
+ioctl neterror warning error emerg ha config console</screen>
</listitem>
<listitem>
- <para>To turn off debugging (except for network error debugging), run the following
- command on all nodes concerned:</para>
+ <para>To turn off debugging except for network error debugging, run
+ the following command on all nodes concerned:</para>
<screen># sysctl -w lnet.debug="neterror"
-lnet.debug = neterror</screen>
+debug=neterror</screen>
</listitem>
- </itemizedlist><itemizedlist>
+ </itemizedlist>
+ <itemizedlist>
<listitem>
- <para>To turn off debugging completely, run the following command on all nodes
+ <para>To turn off debugging completely (except for the minimum error
+ reporting to the console), run the following command on all nodes
concerned:</para>
- <screen># sysctl -w lnet.debug=0
-lnet.debug = 0</screen>
+ <screen># lctl set_param debug=0
+debug=0</screen>
</listitem>
<listitem>
- <para>To set an appropriate debug level for a production environment, run:</para>
- <screen># sysctl -w lnet.debug="warning dlmtrace error emerg ha rpctrace vfstrace"
-lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
- <para>The flags shown in this example collect enough high-level information to aid
- debugging, but they do not cause any serious performance impact.</para>
+ <para>To set an appropriate debug level for a production environment,
+ run:</para>
+ <screen># lctl set_param debug="warning dlmtrace error emerg ha rpctrace vfstrace"
+debug=warning dlmtrace error emerg ha rpctrace vfstrace</screen>
+ <para>The flags shown in this example collect enough high-level
+ information to aid debugging, but they do not cause any serious
+ performance impact.</para>
</listitem>
- </itemizedlist><itemizedlist>
- <listitem>
- <para>To clear all flags and set new flags, run:</para>
- <screen># sysctl -w lnet.debug="warning"
-lnet.debug = warning</screen>
- </listitem>
- </itemizedlist><itemizedlist>
+ </itemizedlist>
+ <itemizedlist>
<listitem>
- <para>To add new flags to flags that have already been set, precede each one with a
- "<literal>+</literal>":</para>
- <screen># sysctl -w lnet.debug="+neterror +ha"
-lnet.debug = +neterror +ha
-# sysctl lnet.debug
-lnet.debug = neterror warning ha</screen>
+ <para>To add new flags to flags that have already been set,
+ precede each one with a "<literal>+</literal>":</para>
+ <screen># lctl set_param debug="+neterror +ha"
+debug=+neterror +ha
+# lctl get_param debug
+debug=neterror warning error emerg ha console</screen>
</listitem>
<listitem>
<para>To remove individual flags, precede them with a
"<literal>-</literal>":</para>
- <screen># sysctl -w lnet.debug="-ha"
-lnet.debug = -ha
-# sysctl lnet.debug
-lnet.debug = neterror warning</screen>
+ <screen># lctl set_param debug="-ha"
+debug=-ha
+# lctl get_param debug
+debug=neterror warning error emerg console</screen>
</listitem>
- <listitem>
- <para>To verify or change the debug level, run commands such as the following: :</para>
- <screen># lctl get_param debug
-debug=
-neterror warning
-# lctl set_param debug=+ha
-# lctl get_param debug
-debug=
-neterror warning ha
-# lctl set_param debug=-warning
-# lctl get_param debug
-debug=
-neterror ha</screen>
- </listitem>
- </itemizedlist></para>
+ </itemizedlist>
+ </para>
<para>Debugging parameters include:</para>
<itemizedlist>
<listitem>
<literal>/tmp/lustre-log</literal>.</para>
</listitem>
</itemizedlist>
- <para>These parameters are also set using:<screen>sysctl -w lnet.debug={value}</screen></para>
+ <para>These parameters can also be set using:<screen>sysctl -w lnet.debug={value}</screen></para>
<para>Additional useful parameters: <itemizedlist>
<listitem>
<para><literal>panic_on_lbug</literal> - Causes ''panic'' to be called
<section>
<title>Interpreting OST Statistics</title>
<note>
- <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
+ <para>See also
<xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
</note>
<para>OST <literal>stats</literal> files can be used to provide statistics showing activity
obd_ping 212</screen>
<para>Use the <literal>llstat</literal> utility to monitor statistics over time.</para>
<para>To clear the statistics, use the <literal>-c</literal> option to
- <literal>llstat</literal>. To specify how frequently the statistics should be reported (in
- seconds), use the <literal>-i</literal> option. In the example below, the
- <literal>-c</literal> option clears the statistics and <literal>-i10</literal> option
- reports statistics every 10 seconds:</para>
- <screen role="smaller">$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
+ <literal>llstat</literal>. To specify how frequently the statistics
+ should be reported (in seconds), use the <literal>-i</literal> option.
+ In the example below, the <literal>-c</literal> option clears the
+ statistics and <literal>-i10</literal> option reports statistics every
+ 10 seconds:</para>
+<screen role="smaller">$ llstat -c -i10 ost_io
/usr/bin/llstat: STATS on 06/06/07
/proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
<section>
<title>Interpreting MDT Statistics</title>
<note>
- <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
+ <para>See also
<xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
</note>
<para>MDT <literal>stats</literal> files can be used to track MDT
</section>
</section>
</chapter>
+<!--
+ vim:expandtab:shiftwidth=2:tabstop=8:
+ -->