ConfiguringStorage.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="configuringstorage">
   5   <title xml:id="configuringstorage.title">Configuring Storage on a Lustre File System</title>
   6   <para>This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:</para>
   7   <itemizedlist>
   8     <listitem>
   9       <para>
  10             <xref linkend="dbdoclet.50438208_60972"/>
  11         </para>
  12     </listitem>
  13     <listitem>
  14       <para>
  15             <xref linkend="dbdoclet.50438208_23285"/>
  16         </para>
  17     </listitem>
  18     <listitem>
  19       <para>
  20             <xref linkend="dbdoclet.50438208_40705"/>
  21         </para>
  22     </listitem>
  23     <listitem>
  24       <para>
  25             <xref linkend="dbdoclet.ldiskfs_raid_opts"/>
  26         </para>
  27     </listitem>
  28     <listitem>
  29       <para>
  30             <xref linkend="dbdoclet.50438208_88516"/>
  31         </para>
  32     </listitem>
  33   </itemizedlist>
  34   <note>
  35     <para><emphasis role="bold">It is strongly recommended that storage used in a Lustre file system
  36         be configured with hardware RAID.</emphasis> The Lustre software does not support redundancy
  37       at the file system level and RAID is required to protect against disk failure.</para>
  38   </note>
  39   <section xml:id="dbdoclet.50438208_60972">
  40       <title>
  41           <indexterm><primary>storage</primary><secondary>configuring</secondary></indexterm>
  42           Selecting Storage for the MDT and OSTs</title>
  43     <para>The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.</para>
  44     <para>This section describes issues and recommendations regarding backend storage.</para>
  45     <section remap="h3">
  46         <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>MDT</tertiary></indexterm>Metadata Target (MDT)</title>
  47       <para>I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.</para>
  48     </section>
  49     <section remap="h3">
  50       <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>OST</tertiary></indexterm>Object Storage Server (OST)</title>
  51       <para>A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:</para>
  52       <blockquote>
  53         <para>For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.</para>
  54         <para>For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.</para>
  55         <para>Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.</para>
  56       </blockquote>
  57       <para>For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.</para>
  58       <para>To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.</para>
  59     </section>
  60   </section>
  61   <section xml:id="dbdoclet.50438208_23285">
  62     <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for best practice</tertiary></indexterm>Reliability Best Practices</title>
  63     <para>RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.</para>
  64     <para>Backups of the metadata file systems are recommended. For details, see <xref linkend="backupandrestore"/>.</para>
  65   </section>
  66   <section xml:id="dbdoclet.50438208_40705">
  67     <title><indexterm><primary>storage</primary><secondary>performance tradeoffs</secondary></indexterm>Performance Tradeoffs</title>
  68     <para>A writeback cache in a RAID storage controller can dramatically
  69     increase write performance on many types of RAID arrays if the writes
  70     are not done at full stripe width. Unfortunately, unless the RAID array
  71     has battery-backed cache (a feature only found in some higher-priced
  72     hardware RAID arrays), interrupting the power to the array may result in
  73     out-of-sequence or lost writes, and corruption of RAID parity and/or
  74     filesystem metadata, resulting in data loss.
  75     </para>
  76     <para>Having a read or writeback cache onboard a PCI adapter card installed
  77     in an MDS or OSS is <emphasis>NOT SAFE</emphasis> in a high-availability
  78     (HA) failover configuration, as this will result in inconsistencies between
  79     nodes and immediate or eventual filesystem corruption.  Such devices should
  80     not be used, or should have the onboard cache disabled.</para>
  81     <para>If writeback cache is enabled, a file system check is required
  82     after the array loses power. Data may also be lost because of this.</para>
  83     <para>Therefore, we recommend against the use of writeback cache when
  84     data integrity is critical. You should carefully consider whether the
  85     benefits of using writeback cache outweigh the risks.</para>
  86   </section>
  87   <section xml:id="dbdoclet.ldiskfs_raid_opts">
  88     <title>
  89       <indexterm>
  90         <primary>storage</primary>
  91         <secondary>configuring</secondary>
  92         <tertiary>RAID options</tertiary>
  93       </indexterm>Formatting Options for ldiskfs RAID Devices</title>
  94     <para>When formatting an ldiskfs file system on a RAID device, it can be
  95     beneficial to ensure that I/O requests are aligned with the underlying
  96     RAID geometry. This ensures that Lustre RPCs do not generate unnecessary
  97     disk operations which may reduce performance dramatically. Use the
  98     <literal>--mkfsoptions</literal> parameter to specify additional parameters
  99     when formatting the OST or MDT.</para>
 100     <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following
 101     option to the <literal>--mkfsoptions</literal> parameter option improves
 102     the layout of the file system metadata, ensuring that no single disk
 103     contains all of the allocation bitmaps:</para>
 104     <screen>-E stride = <replaceable>chunk_blocks</replaceable> </screen>
 105     <para>The <literal><replaceable>chunk_blocks</replaceable></literal>
 106     variable is in units of 4096-byte blocks and represents the amount of
 107     contiguous data written to a single disk before moving to the next disk.
 108     This is alternately referred to as the RAID stripe size. This is
 109     applicable to both MDT and OST file systems.</para>
 110     <para>For more information on how to override the defaults while formatting
 111     MDT or OST file systems, see <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>.</para>
 112     <section remap="h3">
 113       <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for mkfs</tertiary></indexterm>Computing file system parameters for mkfs</title>
 114       <para>For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the
 115           <literal><replaceable>stripe_width</replaceable></literal>, where
 116           <literal><replaceable>number_of_data_disks</replaceable></literal>
 117         does <emphasis>not</emphasis> include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):</para>
 118       <screen><replaceable>stripe_width_blocks = chunk_blocks * number_of_data_disks</replaceable> = 1 MB </screen>
 119       <para>If the RAID configuration does not allow
 120           <literal><replaceable>chunk_blocks</replaceable></literal>
 121         to fit evenly into 1 MB, select
 122           <literal><replaceable>stripe_width_blocks</replaceable></literal>,
 123         such that is close to 1 MB, but not larger.</para>
 124       <para>The
 125           <literal><replaceable>stripe_width_blocks</replaceable></literal>
 126         value must equal
 127           <literal><replaceable>chunk_blocks</replaceable> * <replaceable>number_of_data_disks</replaceable></literal>.
 128         Specifying the
 129           <literal><replaceable>stripe_width_blocks</replaceable></literal>
 130         parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.</para>
 131       <para>Run <literal>--reformat</literal> on the file system device (<literal>/dev/sdc</literal>), specifying the RAID geometry to the underlying ldiskfs file system, where:</para>
 132       <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>&quot;</screen>
 133       <informalexample>
 134         <para>A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The
 135             <literal><replaceable>chunk_blocks</replaceable></literal>
 136           &lt;= 1024KB/4 = 256KB.</para>
 137       </informalexample>
 138       <para>Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.</para>
 139       <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>&quot;...</screen>
 140     </section>
 141     <section remap="h3">
 142       <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>external journal</tertiary></indexterm>Choosing Parameters for an External Journal</title>
 143       <para>If you have configured a RAID array and use it directly as an OST,
 144         it contains both data and metadata. For better performance, we
 145         recommend putting the OST journal on a separate device, by creating a
 146         small RAID 1 array and using it as an external journal for the OST.
 147       </para>
 148       <para>In a typical Lustre file system, the default OST journal size is
 149         up to 1GB, and the default MDT journal size is up to 4GB, in order to
 150         handle a high transaction rate without blocking on journal flushes.
 151         Additionally, a copy of the journal is kept in RAM. Therefore, make
 152         sure you have enough RAM on the servers to hold copies of all journals.
 153         </para>
 154       <para>The file system journal options are specified to <literal>mkfs.lustre</literal> using
 155         the <literal>--mkfsoptions</literal> parameter. For example:</para>
 156       <screen>--mkfsoptions &quot;<replaceable>other_options</replaceable> -j -J device=/dev/mdJ&quot; </screen>
 157       <para>To create an external journal, perform these steps for each OST on the OSS:</para>
 158       <orderedlist>
 159         <listitem>
 160           <para>Create a 400 MB (or larger) journal partition (RAID 1 is recommended).</para>
 161           <para>In this example, <literal>/dev/sdb</literal> is a RAID 1 device.</para>
 162         </listitem>
 163         <listitem>
 164           <para>Create a journal device on the partition. Run:</para>
 165           <screen>oss# mke2fs -b 4096 -O journal_dev /dev/sdb <replaceable>journal_size</replaceable></screen>
 166           <para>The value of
 167               <literal><replaceable>journal_size</replaceable></literal>
 168             is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.</para>
 169         </listitem>
 170         <listitem>
 171           <para>Create the OST.</para>
 172           <para>In this example, <literal>/dev/sdc</literal> is the RAID 6 device to be used as the OST, run:</para>
 173           <screen>[oss#] mkfs.lustre --ost ... \
 174 --mkfsoptions=&quot;-J device=/dev/sdb1&quot; /dev/sdc</screen>
 175         </listitem>
 176         <listitem>
 177           <para>Mount the OST as usual.</para>
 178         </listitem>
 179       </orderedlist>
 180     </section>
 181   </section>
 182   <section xml:id="dbdoclet.50438208_88516">
 183     <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>SAN</tertiary></indexterm>Connecting a SAN to a Lustre File System</title>
 184     <para>Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:</para>
 185     <itemizedlist>
 186       <listitem>
 187         <para>In many SAN file systems, clients allocate and lock blocks or inodes individually as
 188           they are updated. The design of the Lustre file system avoids the high contention that
 189           some of these blocks and inodes may have.</para>
 190       </listitem>
 191       <listitem>
 192         <para>The Lustre file system is highly scalable and can have a very large number of clients.
 193           SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is
 194           generally higher than other networking.</para>
 195       </listitem>
 196       <listitem>
 197         <para>File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.</para>
 198       </listitem>
 199     </itemizedlist>
 200   </section>
 201 </chapter>
 202 <!--vim:expandtab:shiftwidth=2:tabstop=8:-->