- <informaltable frame="none">
- <tgroup cols="1">
- <colspec colname="c1" colwidth="100*"/>
- <tbody>
- <row>
- <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438208_pgfId-1291564" xreflabel=""/><emphasis role="bold">It is strongly recommended that hardware RAID be used with Lustre.</emphasis> Lustre currently does not support any redundancy at the file system level and RAID is required to protect agains disk failure.</para></entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable>
- <section remap="h2">
- <title><anchor xml:id="dbdoclet.50438208_pgfId-1291568" xreflabel=""/></title>
- <section remap="h2">
- <title>6.1 <anchor xml:id="dbdoclet.50438208_60972" xreflabel=""/><anchor xml:id="dbdoclet.50438208_72075" xreflabel=""/>Selecting <anchor xml:id="dbdoclet.50438208_marker-1291567" xreflabel=""/>Storage for the MDT and OSTs</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291569" xreflabel=""/>The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291570" xreflabel=""/>This section describes issues and recommendations regarding backend storage.</para>
- <section remap="h3">
- <title><anchor xml:id="dbdoclet.50438208_pgfId-1291571" xreflabel=""/>6.1.1 Metadata Target (MDT)</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291572" xreflabel=""/>I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.</para>
- </section>
- <section remap="h3">
- <title><anchor xml:id="dbdoclet.50438208_pgfId-1291573" xreflabel=""/>6.1.2 Object Storage Server (OST)</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291574" xreflabel=""/>A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291575" xreflabel=""/>For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291576" xreflabel=""/>For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291577" xreflabel=""/>Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291578" xreflabel=""/>For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291579" xreflabel=""/>To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.</para>
- </section>
- </section>
- <section remap="h2">
- <title>6.2 <anchor xml:id="dbdoclet.50438208_23285" xreflabel=""/>Reliability <anchor xml:id="dbdoclet.50438208_marker-1291581" xreflabel=""/>Best Practices</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291583" xreflabel=""/>RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291587" xreflabel=""/>Backups of the metadata file systems are recommended. For details, see <link xl:href="BackupAndRestore.html#50438207_37220">Chapter 17</link>: <link xl:href="BackupAndRestore.html#50438207_66186">Backing Up and Restoring a File System</link>.</para>
- </section>
- <section remap="h2">
- <title>6.3 <anchor xml:id="dbdoclet.50438208_40705" xreflabel=""/>Performance <anchor xml:id="dbdoclet.50438208_marker-1291593" xreflabel=""/>Tradeoffs</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291595" xreflabel=""/>A writeback cache can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes or corruption of RAID parity and future data loss.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291596" xreflabel=""/>If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1291597" xreflabel=""/>Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.</para>
- </section>
- <section remap="h2">
- <title>6.4 <anchor xml:id="dbdoclet.50438208_51921" xreflabel=""/>Formatting Options for <anchor xml:id="dbdoclet.50438208_marker-1291599" xreflabel=""/>RAID Devices</title>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1289920" xreflabel=""/>When formatting a file system on a RAID device, it is beneficial to ensure that I/O requests are aligned with the underlying RAID geometry. This ensures that the Lustre RPCs do not generate unnecessary disk operations which may reduce performance dramatically. Use the --mkfsoptions parameter to specify additional parameters when formatting the OST or MDT.</para>
- <para><anchor xml:id="dbdoclet.50438208_pgfId-1289921" xreflabel=""/>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:</para>
- <screen><anchor xml:id="dbdoclet.50438208_pgfId-1290699" xreflabel=""/>-Estride=<chunk_blocks>
+
+<note><para><emphasis role="bold">It is strongly recommended that hardware RAID be used with Lustre.</emphasis> Lustre currently does not support any redundancy at the file system level and RAID is required to protect agains disk failure.</para></note>
+
+
+<section xml:id='dbdoclet.50438208_60972'>
+ <title>6.1 Selecting Storage for the MDT and OSTs</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291569" xreflabel=""/>The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291570" xreflabel=""/>This section describes issues and recommendations regarding backend storage.</para>
+ <section remap="h3">
+ <title><anchor xml:id="dbdoclet.50438208_pgfId-1291571" xreflabel=""/>6.1.1 Metadata Target (MDT)</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291572" xreflabel=""/>I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.</para>
+ </section>
+ <section remap="h3">
+ <title><anchor xml:id="dbdoclet.50438208_pgfId-1291573" xreflabel=""/>6.1.2 Object Storage Server (OST)</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291574" xreflabel=""/>A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291575" xreflabel=""/>For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291576" xreflabel=""/>For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291577" xreflabel=""/>Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291578" xreflabel=""/>For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291579" xreflabel=""/>To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.</para>
+ </section>
+</section>
+<section xml:id="dbdoclet.50438208_23285">
+ <title>6.2 Reliability Best Practices</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291583" xreflabel=""/>RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291587" xreflabel=""/>Backups of the metadata file systems are recommended. For details, see <xref linkend='backupandrestore'/>.</para>
+</section>
+<section xml:id="dbdoclet.50438208_40705">
+ <title>6.3 Performance Tradeoffs</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291595" xreflabel=""/>A writeback cache can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes or corruption of RAID parity and future data loss.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291596" xreflabel=""/>If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1291597" xreflabel=""/>Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.</para>
+</section>
+<section xml:id="dbdoclet.50438208_51921">
+ <title>6.4 Formatting Options for RAID Devices</title>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1289920" xreflabel=""/>When formatting a file system on a RAID device, it is beneficial to ensure that I/O requests are aligned with the underlying RAID geometry. This ensures that the Lustre RPCs do not generate unnecessary disk operations which may reduce performance dramatically. Use the --mkfsoptions parameter to specify additional parameters when formatting the OST or MDT.</para>
+ <para><anchor xml:id="dbdoclet.50438208_pgfId-1289921" xreflabel=""/>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:</para>
+ <screen><anchor xml:id="dbdoclet.50438208_pgfId-1290699" xreflabel=""/>-Estride=<chunk_blocks>