utilization of many OSTs. For more information about wide striping, see
<xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="section_syy_gcl_qk" />.</para>
+ linkend="wide_striping" />.</para>
</glossdef>
</glossentry>
</glossdiv>
<primary>proc</primary>
<secondary>readahead</secondary>
</indexterm>Tuning File Readahead and Directory Statahead</title>
- <para>File readahead and directory statahead enable reading of data into memory before a
- process requests the data. File readahead reads file content data into memory and directory
- statahead reads metadata into memory. When readahead and statahead work well, a process that
- accesses data finds that the information it needs is available immediately when requested in
- memory without the delay of network I/O.</para>
- <para condition="l22">In Lustre software release 2.2.0, the directory statahead feature was
- improved to enhance directory traversal performance. The improvements primarily addressed
- two issues: <orderedlist>
- <listitem>
- <para>A race condition existed between the statahead thread and other VFS operations
- while processing asynchronous <literal>getattr</literal> RPC replies, causing
- duplicate entries in dcache. This issue was resolved by using statahead local dcache.
- </para>
- </listitem>
- <listitem>
- <para>File size/block attributes pre-fetching was not supported, so the traversing
- thread had to send synchronous glimpse size RPCs to OST(s). This issue was resolved by
- using asynchronous glimpse lock (AGL) RPCs to pre-fetch file size/block attributes
- from OST(s).</para>
- </listitem>
- </orderedlist>
+ <para>File readahead and directory statahead enable reading of data
+ into memory before a process requests the data. File readahead prefetches
+ file content data into memory for <literal>read()</literal> related
+ calls, while directory statahead fetches file metadata into memory for
+ <literal>readdir()</literal> and <literal>stat()</literal> related
+ calls. When readahead and statahead work well, a process that accesses
+ data finds that the information it needs is available immediately in
+ memory on the client when requested without the delay of network I/O.
</para>
<section remap="h4">
<title>Tuning File Readahead</title>
- <para>File readahead is triggered when two or more sequential reads by an application fail
- to be satisfied by data in the Linux buffer cache. The size of the initial readahead is 1
- MB. Additional readaheads grow linearly and increment until the readahead cache on the
- client is full at 40 MB.</para>
+ <para>File readahead is triggered when two or more sequential reads
+ by an application fail to be satisfied by data in the Linux buffer
+ cache. The size of the initial readahead is 1 MB. Additional
+ readaheads grow linearly and increment until the readahead cache on
+ the client is full at 40 MB.</para>
<para>Readahead tunables include:</para>
<itemizedlist>
<listitem>
- <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb</literal>
- - Controls the maximum amount of data readahead on a file. Files are read ahead in
- RPC-sized chunks (1 MB or the size of the <literal>read()</literal> call, if larger)
- after the second sequential read on a file descriptor. Random reads are done at the
- size of the <literal>read()</literal> call only (no readahead). Reads to
- non-contiguous regions of the file reset the readahead algorithm, and readahead is not
- triggered again until sequential reads take place again. </para>
- <para>To disable readahead, set this tunable to 0. The default value is 40 MB.</para>
+ <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb</literal> -
+ Controls the maximum amount of data readahead on a file.
+ Files are read ahead in RPC-sized chunks (1 MB or the size of
+ the <literal>read()</literal> call, if larger) after the second
+ sequential read on a file descriptor. Random reads are done at
+ the size of the <literal>read()</literal> call only (no
+ readahead). Reads to non-contiguous regions of the file reset
+ the readahead algorithm, and readahead is not triggered again
+ until sequential reads take place again.
+ </para>
+ <para>To disable readahead, set
+ <literal>max_read_ahead_mb=0</literal>. The default value is 40 MB.
+ </para>
</listitem>
<listitem>
- <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb</literal>
- - Controls the maximum size of a file that is read in its entirety, regardless of the
- size of the <literal>read()</literal>.</para>
+ <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb</literal> -
+ Controls the maximum size of a file that is read in its entirety,
+ regardless of the size of the <literal>read()</literal>. This
+ avoids multiple small read RPCs on relatively small files, when
+ it is not possible to efficiently detect a sequential read
+ pattern before the whole file has been read.
+ </para>
</listitem>
</itemizedlist>
</section>
<section>
<title>Tuning Directory Statahead and AGL</title>
- <para>Many system commands, such as <literal>ls –l</literal>, <literal>du</literal>, and
- <literal>find</literal>, traverse a directory sequentially. To make these commands run
- efficiently, the directory statahead and asynchronous glimpse lock (AGL) can be enabled to
- improve the performance of traversing.</para>
+ <para>Many system commands, such as <literal>ls –l</literal>,
+ <literal>du</literal>, and <literal>find</literal>, traverse a
+ directory sequentially. To make these commands run efficiently, the
+ directory statahead can be enabled to improve the performance of
+ directory traversal.</para>
<para>The statahead tunables are:</para>
<itemizedlist>
<listitem>
- <para><literal>statahead_max</literal> - Controls whether directory statahead is enabled
- and the maximum statahead window size (i.e., how many files can be pre-fetched by the
- statahead thread). By default, statahead is enabled and the value of
- <literal>statahead_max</literal> is 32.</para>
- <para>To disable statahead, run:</para>
+ <para><literal>statahead_max</literal> -
+ Controls the maximum number of file attributes that will be
+ prefetched by the statahead thread. By default, statahead is
+ enabled and <literal>statahead_max</literal> is 32 files.</para>
+ <para>To disable statahead, set <literal>statahead_max</literal>
+ to zero via the following command on the client:</para>
<screen>lctl set_param llite.*.statahead_max=0</screen>
- <para>To set the maximum statahead window size (<replaceable>n</replaceable>),
- run:</para>
+ <para>To change the maximum statahead window size on a client:</para>
<screen>lctl set_param llite.*.statahead_max=<replaceable>n</replaceable></screen>
- <para>The maximum value of <replaceable>n</replaceable> is 8192.</para>
- <para>The AGL can be controlled by entering:</para>
- <screen>lctl set_param llite.*.statahead_agl=<replaceable>n</replaceable></screen>
- <para>The default value for <replaceable>n</replaceable> is 1, which enables the AGL. If
- <replaceable>n</replaceable> is 0, the AGL is disabled.</para>
+ <para>The maximum <literal>statahead_max</literal> is 8192 files.
+ </para>
+ <para>The directory statahead thread will also prefetch the file
+ size/block attributes from the OSTs, so that all file attributes
+ are available on the client when requested by an application.
+ This is controlled by the asynchronous glimpse lock (AGL) setting.
+ The AGL behaviour can be disabled by setting:</para>
+ <screen>lctl set_param llite.*.statahead_agl=0</screen>
</listitem>
<listitem>
- <para><literal>statahead_stats</literal> - A read-only interface that indicates the
- current statahead and AGL statistics, such as how many times statahead/AGL has been
- triggered since the last mount, how many statahead/AGL failures have occurred due to
- an incorrect prediction or other causes.</para>
+ <para><literal>statahead_stats</literal> -
+ A read-only interface that provides current statahead and AGL
+ statistics, such as how many times statahead/AGL has been triggered
+ since the last mount, how many statahead/AGL failures have occurred
+ due to an incorrect prediction or other causes.</para>
<note>
- <para>The AGL is affected by statahead because the inodes processed by AGL are built
- by the statahead thread, which means the statahead thread is the input of the AGL
- pipeline. So if statahead is disabled, then the AGL is disabled by force.</para>
+ <para>AGL behaviour is affected by statahead since the inodes
+ processed by AGL are built by the statahead thread. If
+ statahead is disabled, then AGL is also disabled.</para>
</note>
</listitem>
</itemizedlist>
<para><xref linkend="dbdoclet.50438209_10424"/></para>
</listitem>
<listitem>
- <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_syy_gcl_qk"/></para>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/></para>
</listitem>
</itemizedlist>
<section xml:id="dbdoclet.50438209_79324">
objects on OSTs with more free space. (This can reduce I/O performance until space usage is
rebalanced again.) For a more detailed description of how striping is allocated, see <xref
linkend="dbdoclet.50438209_10424"/>.</para>
- <para condition="l22">Files can only be striped over a finite number of OSTs. Prior to Lustre
- software release 2.2, the maximum number of OSTs that a file could be striped across was
- limited to 160. As of Lustre software release 2.2, the maximum number of OSTs is 2000. For
- more information, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="section_syy_gcl_qk"/>.</para>
+ <para>Files can only be striped over a finite number of OSTs, based on the
+ maximum size of the attributes that can be stored on the MDT. If the MDT
+ is ldiskfs-based without the <literal>ea_inode</literal> feature, a file
+ can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the
+ <literal>ea_inode</literal> feature is enabled for an ldiskfs-based MDT,
+ a file can be striped across up to 2000 OSTs. For more information, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/>.
+ </para>
</section>
<section xml:id="dbdoclet.50438209_48033">
<title><indexterm>
</note>
</section>
</section>
- <section xml:id="section_syy_gcl_qk">
+ <section xml:id="wide_striping">
<title><indexterm>
<primary>striping</primary>
<secondary>wide striping</secondary>
</indexterm><indexterm>
<primary>wide striping</primary>
</indexterm>Lustre Striping Internals</title>
- <para>For Lustre releases prior to Lustre software release 2.2, files can be striped across a
- maximum of 160 OSTs. Lustre inodes use an extended attribute to record the location of each
- object (the object ID and the number of the OST on which it is stored). The size of the
- extended attribute limits the maximum stripe count to 160 objects.</para>
- <para condition="l22">In Lustre software release 2.2 and subsequent releases, the maximum number
- of OSTs over which files can be striped has been raised to 2000 by allocating a new block on
- which to store the extended attribute that holds the object information. This feature, known
- as "wide striping," only allocates the additional extended attribute data block if the file is
- striped with a stripe count greater than 160. The file layout (object ID, OST number) is
- stored on the new data block with a pointer to this block stored in the original Lustre inode
- for the file. For files smaller than 160 objects, the Lustre inode is used to store the file
- layout.</para>
+ <para>Individual files can only be striped over a finite number of OSTs,
+ based on the maximum size of the attributes that can be stored on the MDT.
+ If the MDT is ldiskfs-based without the <literal>ea_inode</literal>
+ feature, a file can be striped across at most 160 OSTs. With ZFS-based
+ MDTs, or if the <literal>ea_inode</literal> feature is enabled for an
+ ldiskfs-based MDT, a file can be striped across up to 2000 OSTs.
+ </para>
+ <para>Lustre inodes use an extended attribute to record on which OST each
+ object is located, and the identifier each object on that OST. The size of
+ the extended attribute is a function of the number of stripes.</para>
+ <para>If using an ldiskfs-based MDT, the maximum number of OSTs over which
+ files can be striped can been raised to 2000 by enabling the
+ <literal>ea_inode</literal> feature on the MDT:
+ <screen>tune2fs -O ea_inode /dev/<replaceable>mdtdev</replaceable></screen>
+ </para>
+ <note><para>The maximum stripe count for a single file does not limit the
+ maximum number of OSTs that are in the filesystem as a whole, only the
+ maximum possible size and maximum aggregate bandwidth for the file.
+ </para></note>
</section>
</chapter>
</tgroup>
</table>
<para> </para>
- <note>
- <para condition="l22">In Lustre software releases prior to version 2.2,
- the maximum stripe count for a single file was limited to 160 OSTs.
- In version 2.2, the wide striping feature was added to support files
- striped over up to 2000 OSTs. In order to store the large layout for
- such files in ldiskfs, the <literal>ea_inode</literal> feature must
- be enabled on the MDT, but no similar tunable is needed for ZFS MDTs.
- This feature is disabled by default at
- <literal>mkfs.lustre</literal> time. In order to enable this feature,
- specify <literal>--mkfsoptions="-O ea_inode"</literal> at MDT format
- time, or use <literal>tune2fs -O ea_inode</literal> to enable it after
- the MDT has been formatted. Using either the deprecated
- <literal>large_xattr</literal> or preferred <literal>ea_inode</literal>
- feature name results in <literal>ea_inode</literal> being shown in
- the file system feature list.</para>
+ <note><para>By default for ldiskfs MDTs the maximum stripe count for a
+ <emphasis>single file</emphasis> is limited to 160 OSTs. In order to
+ increase the maximum file stripe count, use
+ <literal>--mkfsoptions="-O ea_inode"</literal> when formatting the MDT,
+ or use <literal>tune2fs -O ea_inode</literal> to enable it after the
+ MDT has been formatted.</para>
</note>
</section>
</section>
provider.</para>
</note>
<note>
- <para condition="l22">In Lustre software release 2.2, a feature has been
- added that allows striping across up to 2000 OSTs. By default, this "wide
- striping" feature is disabled. It is activated by setting the
- <literal>large_xattr</literal> or
- <literal>ea_inode</literal> option on the MDT using either
- <literal>mkfs.lustre</literal> or
- <literal>tune2fs</literal>. For example after upgrading an existing file
- system to Lustre software release 2.2 or later, wide striping can be
- enabled by running the following command on the MDT device before
- mounting it:
+ <para>In Lustre software release 2.2, a feature has been added for
+ ldiskfs-based MDTs that allows striping a single file across up to 2000
+ OSTs. By default, this "wide striping" feature is disabled. It is
+ activated by setting the <literal>ea_inode</literal> option on the MDT
+ using either <literal>mkfs.lustre</literal> or <literal>tune2fs</literal>.
+ For example after upgrading an existing file system to Lustre software
+ release 2.2 or later, wide striping can be enabled by running the
+ following command on the MDT device before mounting it:
<screen>tune2fs -O large_xattr</screen>
Once the wide striping feature is enabled and in use on the MDT, it is
not possible to directly downgrade the MDT file system to an earlier
disable wide striping:
<orderedlist>
<listitem>
- <para>Delete all wide-striped files.</para>
- <para>OR</para>
- <para>Use
- <literal>lfs_migrate</literal> with the option
- <literal>-c</literal>
- <replaceable>stripe_count</replaceable>(set
- <replaceable>stripe_count</replaceable>to 160) to move the files to
- another location.</para>
+ <para>Delete all wide-striped files, <emphasis>OR</emphasis>
+ use <literal>lfs_migrate -c 160</literal> (or fewer stripes)
+ to migrate the files to use fewer OSTs. This does not affect the
+ total number of OSTs that the whole filesystem can access.</para>
</listitem>
<listitem>
<para>Unmount the MDT.</para>
<listitem>
<para>(Optional) For upgrades to Lustre software release 2.2 or higher,
to enable wide striping on an existing MDT, run the following command
- on the MDT :
- <screen>mdt# tune2fs -O large_xattr <replaceable>device</replaceable></screen></para>
+ on the MDT:
+ <screen>tune2fs -O ea_inode /dev/<replaceable>mdtdev</replaceable></screen>
+ </para>
<para>For more information about wide striping, see
- <xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="section_syy_gcl_qk" />.</para>
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping" />.</para>
</listitem>
<listitem>
<para>(Optional) For upgrades to Lustre software release 2.4 or higher,
<xsl:template name="condition-decorator">
<xsl:param name='content'/>
<xsl:choose>
- <xsl:when test="@condition = 'l21'">
- <xsl:call-template name='textdecoration-1'>
- <xsl:with-param name='version' select="'Introduced in Lustre 2.1'"/>
- <xsl:with-param name='content' select="$content"/>
- </xsl:call-template>
- </xsl:when>
- <xsl:when test="@condition = 'l22'">
- <xsl:call-template name='textdecoration-1'>
- <xsl:with-param name='version' select="'Introduced in Lustre 2.2'"/>
- <xsl:with-param name='content' select="$content"/>
- </xsl:call-template>
- </xsl:when>
<xsl:when test="@condition = 'l23'">
<xsl:call-template name='textdecoration-1'>
<xsl:with-param name='version' select="'Introduced in Lustre 2.3'"/>
<xsl:with-param name='content' select="$content"/>
</xsl:call-template>
</xsl:when>
+ <xsl:when test="@condition = 'l2A'">
+ <xsl:call-template name='textdecoration-1'>
+ <xsl:with-param name='version' select="'Introduced in Lustre 2.10'"/>
+ <xsl:with-param name='content' select="$content"/>
+ </xsl:call-template>
+ </xsl:when>
+ <xsl:when test="@condition = 'l2B'">
+ <xsl:call-template name='textdecoration-1'>
+ <xsl:with-param name='version' select="'Introduced in Lustre 2.11'"/>
+ <xsl:with-param name='content' select="$content"/>
+ </xsl:call-template>
+ </xsl:when>
<xsl:when test="@condition != ''">
<xsl:call-template name='textdecoration-1'>
- <xsl:with-param name='version' select="'Introduced in Lustre 2.9'"/>
+ <xsl:with-param name='version' select="'Introduced in Lustre 2.10'"/>
<xsl:with-param name='content' select="$content"/>
</xsl:call-template>
</xsl:when>
<xsl:param name='condition'/>
<!-- add another span to hold the lustre version annotation -->
<xsl:choose>
- <xsl:when test="$condition = 'l21'">
- <span class='floatright'>L 2.1 </span>
- </xsl:when>
- <xsl:when test="$condition = 'l22'">
- <span class='floatright'>L 2.2 </span>
- </xsl:when>
<xsl:when test="$condition = 'l23'">
<span class='floatright'>L 2.3 </span>
</xsl:when>
<xsl:when test="$condition = 'l29'">
<span class='floatright'>L 2.9 </span>
</xsl:when>
+ <xsl:when test="$condition = 'l2A'">
+ <span class='floatright'>L 2.10 </span>
+ </xsl:when>
+ <xsl:when test="$condition = 'l2B'">
+ <span class='floatright'>L 2.11 </span>
+ </xsl:when>
<xsl:when test="$condition != ''">
<span class='floatright'>L ?.? </span>
</xsl:when>
</xsl:variable>
<xsl:variable name="versionstr">
<xsl:choose>
- <xsl:when test="@condition = 'l21'">Introduced in Lustre 2.1</xsl:when>
- <xsl:when test="@condition = 'l22'">Introduced in Lustre 2.2</xsl:when>
<xsl:when test="@condition = 'l23'">Introduced in Lustre 2.3</xsl:when>
<xsl:when test="@condition = 'l24'">Introduced in Lustre 2.4</xsl:when>
<xsl:when test="@condition = 'l25'">Introduced in Lustre 2.5</xsl:when>
<xsl:when test="@condition = 'l27'">Introduced in Lustre 2.7</xsl:when>
<xsl:when test="@condition = 'l28'">Introduced in Lustre 2.8</xsl:when>
<xsl:when test="@condition = 'l29'">Introduced in Lustre 2.9</xsl:when>
+ <xsl:when test="@condition = 'l2A'">Introduced in Lustre 2.10</xsl:when>
+ <xsl:when test="@condition = 'l2B'">Introduced in Lustre 2.11</xsl:when>
<xsl:otherwise>Documentation Error: unrecognised condition attribute</xsl:otherwise>
</xsl:choose>
</xsl:variable>
</xsl:variable>
<xsl:variable name="lustrecond">
<xsl:choose>
- <xsl:when test="@condition='l21'">L 2.1</xsl:when>
- <xsl:when test="@condition='l22'">L 2.2</xsl:when>
<xsl:when test="@condition='l23'">L 2.3</xsl:when>
<xsl:when test="@condition='l24'">L 2.4</xsl:when>
<xsl:when test="@condition='l25'">L 2.5</xsl:when>
<xsl:when test="@condition='l27'">L 2.7</xsl:when>
<xsl:when test="@condition='l28'">L 2.8</xsl:when>
<xsl:when test="@condition='l29'">L 2.9</xsl:when>
+ <xsl:when test="@condition='l2A'">L 2.10</xsl:when>
+ <xsl:when test="@condition='l2B'">L 2.11</xsl:when>
<xsl:otherwise></xsl:otherwise>
</xsl:choose>
</xsl:variable>