LU-11638 admin: debug space management/missing objects

[doc/manual.git] / ConfiguringStorage.xml
diff --git a/ConfiguringStorage.xml b/ConfiguringStorage.xml

index 4e3e6a0..00e7bee 100644 (file)
--- a/ConfiguringStorage.xml
+++ b/ConfiguringStorage.xml
@@ -1,9 +1,6 @@
-<?xml version='1.0' encoding='UTF-8'?>
-<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="configuringstorage">
-  <info>
-    <title xml:id="configuringstorage.title">Configuring Storage on a Lustre File System</title>
-  </info>
-  <para>This chapter describes best practices for storage selection and file system options to optimize perforance on RAID, and includes the following sections:</para>
+<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="configuringstorage">
+  <title xml:id="configuringstorage.title">Configuring Storage on a Lustre File System</title>
+  <para>This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:</para>
    <itemizedlist>
      <listitem>
        <para>
@@ -22,7 +19,7 @@
      </listitem>
      <listitem>
        <para>
-            <xref linkend="dbdoclet.50438208_51921"/>
+            <xref linkend="dbdoclet.ldiskfs_raid_opts"/>
          </para>
      </listitem>
      <listitem>
@@ -32,18 +29,22 @@
      </listitem>
    </itemizedlist>
    <note>
-    <para><emphasis role="bold">It is strongly recommended that hardware RAID be used with Lustre.</emphasis> Lustre currently does not support any redundancy at the file system level and RAID is required to protect agains disk failure.</para>
+    <para><emphasis role="bold">It is strongly recommended that storage used in a Lustre file system
+        be configured with hardware RAID.</emphasis> The Lustre software does not support redundancy
+      at the file system level and RAID is required to protect against disk failure.</para>
    </note>
    <section xml:id="dbdoclet.50438208_60972">
-    <title>6.1 Selecting Storage for the MDT and OSTs</title>
+      <title>
+          <indexterm><primary>storage</primary><secondary>configuring</secondary></indexterm>
+          Selecting Storage for the MDT and OSTs</title>
      <para>The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.</para>
      <para>This section describes issues and recommendations regarding backend storage.</para>
      <section remap="h3">
-      <title>6.1.1 Metadata Target (MDT)</title>
+        <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>MDT</tertiary></indexterm>Metadata Target (MDT)</title>
        <para>I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.</para>
      </section>
      <section remap="h3">
-      <title>6.1.2 Object Storage Server (OST)</title>
+      <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>OST</tertiary></indexterm>Object Storage Server (OST)</title>
        <para>A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:</para>
        <blockquote>
          <para>For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.</para>
@@ -55,63 +56,80 @@
      </section>
    </section>
    <section xml:id="dbdoclet.50438208_23285">
-    <title>6.2 Reliability Best Practices</title>
+    <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for best practice</tertiary></indexterm>Reliability Best Practices</title>
      <para>RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.</para>
      <para>Backups of the metadata file systems are recommended. For details, see <xref linkend="backupandrestore"/>.</para>
    </section>
    <section xml:id="dbdoclet.50438208_40705">
-    <title>6.3 Performance Tradeoffs</title>
+    <title><indexterm><primary>storage</primary><secondary>performance tradeoffs</secondary></indexterm>Performance Tradeoffs</title>
      <para>A writeback cache can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes or corruption of RAID parity and future data loss.</para>
      <para>If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.</para>
      <para>Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.</para>
    </section>
-  <section xml:id="dbdoclet.50438208_51921">
-    <title>6.4 Formatting Options for RAID Devices</title>
-    <para>When formatting a file system on a RAID device, it is beneficial to ensure that I/O requests are aligned with the underlying RAID geometry. This ensures that the Lustre RPCs do not generate unnecessary disk operations which may reduce performance dramatically. Use the <literal>--mkfsoptions</literal> parameter to specify additional parameters when formatting the OST or MDT.</para>
-    <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the <literal>--mkfsoptions</literal> parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:</para>
-    <screen>-E stride = &lt;chunk_blocks&gt; 
-</screen>
-    <para>The <literal>&lt;chunk_blocks&gt;</literal> variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.</para>
-    <para>For more information on how to override the defaults while formatting MDT or OST file systems, see <xref linkend="dbdoclet.50438256_84701"/>.</para>
+  <section xml:id="dbdoclet.ldiskfs_raid_opts">
+    <title>
+      <indexterm>
+        <primary>storage</primary>
+        <secondary>configuring</secondary>
+       <tertiary>RAID options</tertiary>
+      </indexterm>Formatting Options for ldiskfs RAID Devices</title>
+    <para>When formatting an ldiskfs file system on a RAID device, it can be
+    beneficial to ensure that I/O requests are aligned with the underlying
+    RAID geometry. This ensures that Lustre RPCs do not generate unnecessary
+    disk operations which may reduce performance dramatically. Use the
+    <literal>--mkfsoptions</literal> parameter to specify additional parameters
+    when formatting the OST or MDT.</para>
+    <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following
+    option to the <literal>--mkfsoptions</literal> parameter option improves
+    the layout of the file system metadata, ensuring that no single disk
+    contains all of the allocation bitmaps:</para>
+    <screen>-E stride = <replaceable>chunk_blocks</replaceable> </screen>
+    <para>The <literal><replaceable>chunk_blocks</replaceable></literal>
+    variable is in units of 4096-byte blocks and represents the amount of
+    contiguous data written to a single disk before moving to the next disk.
+    This is alternately referred to as the RAID stripe size. This is
+    applicable to both MDT and OST file systems.</para>
+    <para>For more information on how to override the defaults while formatting
+    MDT or OST file systems, see <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>.</para>
      <section remap="h3">
-      <title>6.4.1 Computing file system parameters for mkfs</title>
-      <para>For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the <emphasis>
-          <literal>&lt;stripe_width&gt;</literal>
-        </emphasis>, where <emphasis>
-          <literal>&lt;number_of_data_disks&gt;</literal>
-        </emphasis> does <emphasis>not</emphasis> include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):</para>
-      <screen><emphasis>&lt;stripe_width_blocks&gt;</emphasis> = <emphasis>&lt;chunk_blocks&gt;</emphasis> * <emphasis>&lt;number_of_data_disks&gt;</emphasis> = 1 MB </screen>
-      <para>If the RAID configuration does not allow <emphasis>
-          <literal>&lt;chunk_blocks&gt;</literal>
-        </emphasis> to fit evenly into 1 MB, select <emphasis>
-          <literal>&lt;chunkblocks&gt;</literal>
-        </emphasis><emphasis>
-          <literal>&lt;stripe_width_blocks&gt;</literal>
-        </emphasis>, such that  is close to 1 MB, but not larger.</para>
-      <para>The <emphasis>
-          <literal>&lt;stripe_width_blocks&gt;</literal>
-        </emphasis> value must equal <emphasis>
-          <literal>&lt;chunk_blocks&gt;*&lt;number_of_data_disks&gt;</literal>
-        </emphasis>. Specifying the <emphasis>
-          <literal>&lt;stripe_width_blocks&gt;</literal>
-        </emphasis> parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.</para>
+      <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for mkfs</tertiary></indexterm>Computing file system parameters for mkfs</title>
+      <para>For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the
+          <literal><replaceable>stripe_width</replaceable></literal>, where
+          <literal><replaceable>number_of_data_disks</replaceable></literal>
+        does <emphasis>not</emphasis> include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):</para>
+      <screen><replaceable>stripe_width_blocks = chunk_blocks * number_of_data_disks</replaceable> = 1 MB </screen>
+      <para>If the RAID configuration does not allow
+          <literal><replaceable>chunk_blocks</replaceable></literal>
+        to fit evenly into 1 MB, select
+          <literal><replaceable>stripe_width_blocks</replaceable></literal>,
+        such that is close to 1 MB, but not larger.</para>
+      <para>The 
+          <literal><replaceable>stripe_width_blocks</replaceable></literal>
+        value must equal
+          <literal><replaceable>chunk_blocks</replaceable> * <replaceable>number_of_data_disks</replaceable></literal>.
+        Specifying the
+          <literal><replaceable>stripe_width_blocks</replaceable></literal>
+        parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.</para>
        <para>Run <literal>--reformat</literal> on the file system device (<literal>/dev/sdc</literal>), specifying the RAID geometry to the underlying ldiskfs file system, where:</para>
-      <screen>--mkfsoptions &quot;<emphasis>&lt;other options&gt;</emphasis> -E stride=<emphasis>&lt;chunk_blocks&gt;</emphasis>, stripe_width=<emphasis>&lt;stripe_width_blocks&gt;</emphasis>&quot;
-</screen>
-<informalexample><para>A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The <emphasis>
-            <literal>&lt;chunk_blocks&gt;</literal>
-</emphasis> &lt;= 1024KB/4 = 256KB.</para></informalexample>
+      <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>&quot;</screen>
+      <informalexample>
+        <para>A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The
+            <literal><replaceable>chunk_blocks</replaceable></literal>
+          &lt;= 1024KB/4 = 256KB.</para>
+      </informalexample>
        <para>Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.</para>
-      <screen>--mkfsoptions &quot;<emphasis>&lt;other options&gt;</emphasis> -E stride=<emphasis>&lt;chunk_blocks&gt;</emphasis>, stripe_width=<emphasis>&lt;stripe_width_blocks&gt;</emphasis>&quot;...
-</screen>
+      <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>&quot;...</screen>
      </section>
      <section remap="h3">
-      <title>6.4.2 Choosing Parameters for an External <anchor xml:id="dbdoclet.50438208_marker-1289927" xreflabel=""/>Journal</title>
+      <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>external journal</tertiary></indexterm>Choosing Parameters for an External Journal</title>
        <para>If you have configured a RAID array and use it directly as an OST, it contains both data and metadata. For better performance, we recommend putting the OST journal on a separate device, by creating a small RAID 1 array and using it as an external journal for the OST.</para>
-      <para>Lustre&apos;s default journal size is 400 MB. A journal size of up to 1 GB has shown increased performance but diminishing returns are seen for larger journals. Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough memory available to hold copies of all the journals.</para>
-      <para>The file system journal options are specified to mkfs.luster using the <literal>--mkfsoptions</literal> parameter. For example:</para>
-      <screen>--mkfsoptions &quot;&lt;other options&gt; -j -J device=/dev/mdJ&quot; 
-</screen>
+      <para>In a Lustre file system, the default journal size is 400 MB. A journal size of up to 1
+        GB has shown increased performance but diminishing returns are seen for larger journals.
+        Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough
+        memory available to hold copies of all the journals.</para>
+      <para>The file system journal options are specified to <literal>mkfs.lustre</literal> using
+        the <literal>--mkfsoptions</literal> parameter. For example:</para>
+      <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -j -J device=/dev/mdJ&quot; </screen>
        <para>To create an external journal, perform these steps for each OST on the OSS:</para>
        <orderedlist>
          <listitem>
@@ -120,17 +138,16 @@
          </listitem>
          <listitem>
            <para>Create a journal device on the partition. Run:</para>
-          <screen>[oss#] mke2fs -b 4096 -O journal_dev /dev/sdb <emphasis>&lt;journal_size&gt;</emphasis></screen>
-          <para>The value of <emphasis>
-              <literal>&lt;journal_size&gt;</literal>
-            </emphasis> is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.</para>
+          <screen>oss# mke2fs -b 4096 -O journal_dev /dev/sdb <replaceable>journal_size</replaceable></screen>
+          <para>The value of
+              <literal><replaceable>journal_size</replaceable></literal>
+            is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.</para>
          </listitem>
          <listitem>
            <para>Create the OST.</para>
            <para>In this example, <literal>/dev/sdc</literal> is the RAID 6 device to be used as the OST, run:</para>
-          <screen>[oss#] mkfs.lustre --ost --mgsnode=mds@osib --mkfsoptions=&quot;-J device=/dev/sd\
-b1&quot; /dev/sdc
-</screen>
+          <screen>[oss#] mkfs.lustre --ost ... \
+--mkfsoptions=&quot;-J device=/dev/sdb1&quot; /dev/sdc</screen>
          </listitem>
          <listitem>
            <para>Mount the OST as usual.</para>
@@ -139,14 +156,18 @@ b1&quot; /dev/sdc
      </section>
    </section>
    <section xml:id="dbdoclet.50438208_88516">
-    <title>6.5 Connecting a SAN to a Lustre File System</title>
+    <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>SAN</tertiary></indexterm>Connecting a SAN to a Lustre File System</title>
      <para>Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:</para>
      <itemizedlist>
        <listitem>
-        <para>In many SAN file systems without Lustre, clients allocate and lock blocks or inodes individually as they are updated. The Lustre design avoids the high contention that some of these blocks and inodes may have.</para>
+        <para>In many SAN file systems, clients allocate and lock blocks or inodes individually as
+          they are updated. The design of the Lustre file system avoids the high contention that
+          some of these blocks and inodes may have.</para>
        </listitem>
        <listitem>
-        <para>Lustre is highly scalable and can have a very large number of clients. SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is generally higher than other networking.</para>
+        <para>The Lustre file system is highly scalable and can have a very large number of clients.
+          SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is
+          generally higher than other networking.</para>
        </listitem>
        <listitem>
          <para>File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.</para>