-<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="configuringstorage">
+<?xml version='1.0' encoding='UTF-8'?>
+<chapter xmlns="http://docbook.org/ns/docbook"
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="configuringstorage">
<title xml:id="configuringstorage.title">Configuring Storage on a Lustre File System</title>
<para>This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:</para>
<itemizedlist>
</listitem>
<listitem>
<para>
- <xref linkend="dbdoclet.50438208_51921"/>
+ <xref linkend="dbdoclet.ldiskfs_raid_opts"/>
</para>
</listitem>
<listitem>
</section>
<section xml:id="dbdoclet.50438208_40705">
<title><indexterm><primary>storage</primary><secondary>performance tradeoffs</secondary></indexterm>Performance Tradeoffs</title>
- <para>A writeback cache can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes or corruption of RAID parity and future data loss.</para>
- <para>If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.</para>
- <para>Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.</para>
+ <para>A writeback cache in a RAID storage controller can dramatically
+ increase write performance on many types of RAID arrays if the writes
+ are not done at full stripe width. Unfortunately, unless the RAID array
+ has battery-backed cache (a feature only found in some higher-priced
+ hardware RAID arrays), interrupting the power to the array may result in
+ out-of-sequence or lost writes, and corruption of RAID parity and/or
+ filesystem metadata, resulting in data loss.
+ </para>
+ <para>Having a read or writeback cache onboard a PCI adapter card installed
+ in an MDS or OSS is <emphasis>NOT SAFE</emphasis> in a high-availability
+ (HA) failover configuration, as this will result in inconsistencies between
+ nodes and immediate or eventual filesystem corruption. Such devices should
+ not be used, or should have the onboard cache disabled.</para>
+ <para>If writeback cache is enabled, a file system check is required
+ after the array loses power. Data may also be lost because of this.</para>
+ <para>Therefore, we recommend against the use of writeback cache when
+ data integrity is critical. You should carefully consider whether the
+ benefits of using writeback cache outweigh the risks.</para>
</section>
- <section xml:id="dbdoclet.50438208_51921">
- <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>RAID options</tertiary></indexterm>Formatting Options for RAID Devices</title>
- <para>When formatting a file system on a RAID device, it is beneficial to ensure that I/O
- requests are aligned with the underlying RAID geometry. This ensures that Lustre RPCs do not
- generate unnecessary disk operations which may reduce performance dramatically. Use the
- <literal>--mkfsoptions</literal> parameter to specify additional parameters when formatting
- the OST or MDT.</para>
- <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the <literal>--mkfsoptions</literal> parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:</para>
+ <section xml:id="dbdoclet.ldiskfs_raid_opts">
+ <title>
+ <indexterm>
+ <primary>storage</primary>
+ <secondary>configuring</secondary>
+ <tertiary>RAID options</tertiary>
+ </indexterm>Formatting Options for ldiskfs RAID Devices</title>
+ <para>When formatting an ldiskfs file system on a RAID device, it can be
+ beneficial to ensure that I/O requests are aligned with the underlying
+ RAID geometry. This ensures that Lustre RPCs do not generate unnecessary
+ disk operations which may reduce performance dramatically. Use the
+ <literal>--mkfsoptions</literal> parameter to specify additional parameters
+ when formatting the OST or MDT.</para>
+ <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following
+ option to the <literal>--mkfsoptions</literal> parameter option improves
+ the layout of the file system metadata, ensuring that no single disk
+ contains all of the allocation bitmaps:</para>
<screen>-E stride = <replaceable>chunk_blocks</replaceable> </screen>
- <para>The <literal><replaceable>chunk_blocks</replaceable></literal> variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.</para>
- <para>For more information on how to override the defaults while formatting MDT or OST file systems, see <xref linkend="dbdoclet.50438256_84701"/>.</para>
+ <para>The <literal><replaceable>chunk_blocks</replaceable></literal>
+ variable is in units of 4096-byte blocks and represents the amount of
+ contiguous data written to a single disk before moving to the next disk.
+ This is alternately referred to as the RAID stripe size. This is
+ applicable to both MDT and OST file systems.</para>
+ <para>For more information on how to override the defaults while formatting
+ MDT or OST file systems, see <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>.</para>
<section remap="h3">
<title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for mkfs</tertiary></indexterm>Computing file system parameters for mkfs</title>
<para>For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the
</section>
<section remap="h3">
<title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>external journal</tertiary></indexterm>Choosing Parameters for an External Journal</title>
- <para>If you have configured a RAID array and use it directly as an OST, it contains both data and metadata. For better performance, we recommend putting the OST journal on a separate device, by creating a small RAID 1 array and using it as an external journal for the OST.</para>
- <para>In a Lustre file system, the default journal size is 400 MB. A journal size of up to 1
- GB has shown increased performance but diminishing returns are seen for larger journals.
- Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough
- memory available to hold copies of all the journals.</para>
+ <para>If you have configured a RAID array and use it directly as an OST,
+ it contains both data and metadata. For better performance, we
+ recommend putting the OST journal on a separate device, by creating a
+ small RAID 1 array and using it as an external journal for the OST.
+ </para>
+ <para>In a typical Lustre file system, the default OST journal size is
+ up to 1GB, and the default MDT journal size is up to 4GB, in order to
+ handle a high transaction rate without blocking on journal flushes.
+ Additionally, a copy of the journal is kept in RAM. Therefore, make
+ sure you have enough RAM on the servers to hold copies of all journals.
+ </para>
<para>The file system journal options are specified to <literal>mkfs.lustre</literal> using
the <literal>--mkfsoptions</literal> parameter. For example:</para>
<screen>--mkfsoptions "<replaceable>other_options</replaceable> -j -J device=/dev/mdJ" </screen>
<listitem>
<para>Create the OST.</para>
<para>In this example, <literal>/dev/sdc</literal> is the RAID 6 device to be used as the OST, run:</para>
- <screen>[oss#] mkfs.lustre --mgsnode=mds@osib --ost --index=0 \
+ <screen>[oss#] mkfs.lustre --ost ... \
--mkfsoptions="-J device=/dev/sdb1" /dev/sdc</screen>
</listitem>
<listitem>
</itemizedlist>
</section>
</chapter>
+<!--
+ vim:expandtab:shiftwidth=2:tabstop=8:
+ -->