1 <?xml version='1.0' encoding='UTF-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="configuringstorage">
5 <title xml:id="configuringstorage.title">Configuring Storage on a Lustre File System</title>
6 <para>This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:</para>
10 <xref linkend="dbdoclet.50438208_60972"/>
15 <xref linkend="dbdoclet.50438208_23285"/>
20 <xref linkend="dbdoclet.50438208_40705"/>
25 <xref linkend="dbdoclet.ldiskfs_raid_opts"/>
30 <xref linkend="dbdoclet.50438208_88516"/>
35 <para><emphasis role="bold">It is strongly recommended that storage used in a Lustre file system
36 be configured with hardware RAID.</emphasis> The Lustre software does not support redundancy
37 at the file system level and RAID is required to protect against disk failure.</para>
39 <section xml:id="dbdoclet.50438208_60972">
41 <indexterm><primary>storage</primary><secondary>configuring</secondary></indexterm>
42 Selecting Storage for the MDT and OSTs</title>
43 <para>The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.</para>
44 <para>This section describes issues and recommendations regarding backend storage.</para>
46 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>MDT</tertiary></indexterm>Metadata Target (MDT)</title>
47 <para>I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.</para>
50 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>OST</tertiary></indexterm>Object Storage Server (OST)</title>
51 <para>A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:</para>
53 <para>For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.</para>
54 <para>For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.</para>
55 <para>Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.</para>
57 <para>For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.</para>
58 <para>To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.</para>
61 <section xml:id="dbdoclet.50438208_23285">
62 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for best practice</tertiary></indexterm>Reliability Best Practices</title>
63 <para>RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.</para>
64 <para>Backups of the metadata file systems are recommended. For details, see <xref linkend="backupandrestore"/>.</para>
66 <section xml:id="dbdoclet.50438208_40705">
67 <title><indexterm><primary>storage</primary><secondary>performance tradeoffs</secondary></indexterm>Performance Tradeoffs</title>
68 <para>A writeback cache in a RAID storage controller can dramatically
69 increase write performance on many types of RAID arrays if the writes
70 are not done at full stripe width. Unfortunately, unless the RAID array
71 has battery-backed cache (a feature only found in some higher-priced
72 hardware RAID arrays), interrupting the power to the array may result in
73 out-of-sequence or lost writes, and corruption of RAID parity and/or
74 filesystem metadata, resulting in data loss.
76 <para>Having a read or writeback cache onboard a PCI adapter card installed
77 in an MDS or OSS is <emphasis>NOT SAFE</emphasis> in a high-availability
78 (HA) failover configuration, as this will result in inconsistencies between
79 nodes and immediate or eventual filesystem corruption. Such devices should
80 not be used, or should have the onboard cache disabled.</para>
81 <para>If writeback cache is enabled, a file system check is required
82 after the array loses power. Data may also be lost because of this.</para>
83 <para>Therefore, we recommend against the use of writeback cache when
84 data integrity is critical. You should carefully consider whether the
85 benefits of using writeback cache outweigh the risks.</para>
87 <section xml:id="dbdoclet.ldiskfs_raid_opts">
90 <primary>storage</primary>
91 <secondary>configuring</secondary>
92 <tertiary>RAID options</tertiary>
93 </indexterm>Formatting Options for ldiskfs RAID Devices</title>
94 <para>When formatting an ldiskfs file system on a RAID device, it can be
95 beneficial to ensure that I/O requests are aligned with the underlying
96 RAID geometry. This ensures that Lustre RPCs do not generate unnecessary
97 disk operations which may reduce performance dramatically. Use the
98 <literal>--mkfsoptions</literal> parameter to specify additional parameters
99 when formatting the OST or MDT.</para>
100 <para>For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following
101 option to the <literal>--mkfsoptions</literal> parameter option improves
102 the layout of the file system metadata, ensuring that no single disk
103 contains all of the allocation bitmaps:</para>
104 <screen>-E stride = <replaceable>chunk_blocks</replaceable> </screen>
105 <para>The <literal><replaceable>chunk_blocks</replaceable></literal>
106 variable is in units of 4096-byte blocks and represents the amount of
107 contiguous data written to a single disk before moving to the next disk.
108 This is alternately referred to as the RAID stripe size. This is
109 applicable to both MDT and OST file systems.</para>
110 <para>For more information on how to override the defaults while formatting
111 MDT or OST file systems, see <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>.</para>
113 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>for mkfs</tertiary></indexterm>Computing file system parameters for mkfs</title>
114 <para>For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the
115 <literal><replaceable>stripe_width</replaceable></literal>, where
116 <literal><replaceable>number_of_data_disks</replaceable></literal>
117 does <emphasis>not</emphasis> include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):</para>
118 <screen><replaceable>stripe_width_blocks = chunk_blocks * number_of_data_disks</replaceable> = 1 MB </screen>
119 <para>If the RAID configuration does not allow
120 <literal><replaceable>chunk_blocks</replaceable></literal>
121 to fit evenly into 1 MB, select
122 <literal><replaceable>stripe_width_blocks</replaceable></literal>,
123 such that is close to 1 MB, but not larger.</para>
125 <literal><replaceable>stripe_width_blocks</replaceable></literal>
127 <literal><replaceable>chunk_blocks</replaceable> * <replaceable>number_of_data_disks</replaceable></literal>.
129 <literal><replaceable>stripe_width_blocks</replaceable></literal>
130 parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.</para>
131 <para>Run <literal>--reformat</literal> on the file system device (<literal>/dev/sdc</literal>), specifying the RAID geometry to the underlying ldiskfs file system, where:</para>
132 <screen>--mkfsoptions "<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>"</screen>
134 <para>A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The
135 <literal><replaceable>chunk_blocks</replaceable></literal>
136 <= 1024KB/4 = 256KB.</para>
138 <para>Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.</para>
139 <screen>--mkfsoptions "<replaceable>other_options</replaceable> -E stride=<replaceable>chunk_blocks</replaceable>, stripe_width=<replaceable>stripe_width_blocks</replaceable>"...</screen>
142 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>external journal</tertiary></indexterm>Choosing Parameters for an External Journal</title>
143 <para>If you have configured a RAID array and use it directly as an OST,
144 it contains both data and metadata. For better performance, we
145 recommend putting the OST journal on a separate device, by creating a
146 small RAID 1 array and using it as an external journal for the OST.
148 <para>In a typical Lustre file system, the default OST journal size is
149 up to 1GB, and the default MDT journal size is up to 4GB, in order to
150 handle a high transaction rate without blocking on journal flushes.
151 Additionally, a copy of the journal is kept in RAM. Therefore, make
152 sure you have enough RAM on the servers to hold copies of all journals.
154 <para>The file system journal options are specified to <literal>mkfs.lustre</literal> using
155 the <literal>--mkfsoptions</literal> parameter. For example:</para>
156 <screen>--mkfsoptions "<replaceable>other_options</replaceable> -j -J device=/dev/mdJ" </screen>
157 <para>To create an external journal, perform these steps for each OST on the OSS:</para>
160 <para>Create a 400 MB (or larger) journal partition (RAID 1 is recommended).</para>
161 <para>In this example, <literal>/dev/sdb</literal> is a RAID 1 device.</para>
164 <para>Create a journal device on the partition. Run:</para>
165 <screen>oss# mke2fs -b 4096 -O journal_dev /dev/sdb <replaceable>journal_size</replaceable></screen>
167 <literal><replaceable>journal_size</replaceable></literal>
168 is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.</para>
171 <para>Create the OST.</para>
172 <para>In this example, <literal>/dev/sdc</literal> is the RAID 6 device to be used as the OST, run:</para>
173 <screen>[oss#] mkfs.lustre --ost ... \
174 --mkfsoptions="-J device=/dev/sdb1" /dev/sdc</screen>
177 <para>Mount the OST as usual.</para>
182 <section xml:id="dbdoclet.50438208_88516">
183 <title><indexterm><primary>storage</primary><secondary>configuring</secondary><tertiary>SAN</tertiary></indexterm>Connecting a SAN to a Lustre File System</title>
184 <para>Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:</para>
187 <para>In many SAN file systems, clients allocate and lock blocks or inodes individually as
188 they are updated. The design of the Lustre file system avoids the high contention that
189 some of these blocks and inodes may have.</para>
192 <para>The Lustre file system is highly scalable and can have a very large number of clients.
193 SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is
194 generally higher than other networking.</para>
197 <para>File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.</para>
202 <!--vim:expandtab:shiftwidth=2:tabstop=8:-->