-
-
-
- <section xml:id="dbdoclet.50438256_49017">
- <title>5.1 Hardware Considerations</title>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292812" xreflabel=""/>Lustre can work with any kind of block storage device such as single disks, software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file systems, the block devices are only attached to the MDS and OSS nodes in Lustre and are not accessed by the clients directly.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292817" xreflabel=""/>Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1290297" xreflabel=""/>For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292437" xreflabel=""/>For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292435" xreflabel=""/>Performance and other issues can occur when an MDS or OSS and a client are running on the same machine:</para>
- <itemizedlist><listitem>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292457" xreflabel=""/> Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
- </listitem>
-<listitem>
- <para> </para>
- </listitem>
-<listitem>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292461" xreflabel=""/> Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
- </listitem>
-<listitem>
- <para> </para>
- </listitem>
-</itemizedlist>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292132" xreflabel=""/>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are typically used for testing to match expected customer usage and avoid limitations due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. Also, due to kernel API limitations, performing backups of Lustre 2.x. filesystems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit inode number.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292472" xreflabel=""/>The storage attached to the servers typically uses RAID to provide fault tolerance and can optionally be organized with logical volume management (LVM). It is then formatted by Lustre as a file system. Lustre OSS and MDS servers read, write and modify data in the format imposed by the file system.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292545" xreflabel=""/>Lustre uses journaling file system technology on both the MDTs and OSTs. For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on a separate device.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1292546" xreflabel=""/>The MDS can effectively utilize a lot of CPU cycles. A minimium of four processor cores are recommended. More are advisable for files systems with many clients.</para>
- <note>
- <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). <anchor xml:id="dbdoclet.50438256_51943" xreflabel=""/></para>
- </note>
-
- <section remap="h3">
- <title><anchor xml:id="dbdoclet.50438256_pgfId-1293428" xreflabel=""/>5.1.1 MDT Storage Hardware Considerations</title>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1293438" xreflabel=""/>The data access pattern for MDS storage is a database-like access pattern with many seeks and read-and-writes of small amounts of data. High throughput to MDS storage is not important. Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be used for the MDT.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1295314" xreflabel=""/>For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1295315" xreflabel=""/>If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1295316" xreflabel=""/>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para>
- </section>
- <section remap="h3">
- <title><anchor xml:id="dbdoclet.50438256_pgfId-1295312" xreflabel=""/>5.1.2 OST Storage Hardware Considerations</title>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1293429" xreflabel=""/>The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target up to 16 terabytes (TBs) in size.</para>
- <para><anchor xml:id="dbdoclet.50438256_pgfId-1293431" xreflabel=""/>Lustre file system capacity is the sum of the capacities provided by the targets. For example, 64 OSSs, each with two 8 TB targets, provide a file system with a capacity of nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID 6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to 400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a system network like InfiniBand that provides a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural constraints described here are simple, in practice it takes careful hardware selection, benchmarking and integration to obtain such results.)</para>
- </section>
+ <section xml:id="dbdoclet.50438256_49017">
+ <title><indexterm><primary>setup</primary></indexterm>
+ <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>
+ <indexterm><primary>design</primary><see>setup</see></indexterm>
+ Hardware Considerations</title>
+ <para>A Lustre file system can utilize any kind of block storage device such as single disks,
+ software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file
+ systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system
+ and are not accessed by the clients directly.</para>
+ <para>Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)</para>
+ <para>For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.</para>
+ <para>For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.</para>
+ <warning><para>Performance and recovery issues can occur if you put a client on an MDS or OSS:</para>
+ <itemizedlist>
+ <listitem>
+ <para>Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
+ </listitem>
+ <listitem>
+ <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
+ </listitem>
+ </itemizedlist>
+ </warning>
+ <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
+ typically used for testing to match expected customer usage and avoid limitations due to the 4
+ GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
+ Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
+ systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
+ inode number.</para>
+ <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
+ optionally be organized with logical volume management (LVM), which is then formatted as a
+ Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
+ imposed by the file system.</para>
+ <para>The Lustre file system uses journaling file system technology on both the MDTs and OSTs.
+ For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on
+ a separate device.</para>
+ <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
+ <note>
+ <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para>
+ </note>
+ <section remap="h3">
+ <title><indexterm>
+ <primary>setup</primary>
+ <secondary>MDT</secondary>
+ </indexterm> MGT and MDT Storage Hardware Considerations</title>
+ <para>MGT storage requirements are small (less than 100 MB even in the
+ largest Lustre file systems), and the data on an MGT is only accessed
+ on a server/client mount, so disk performance is not a consideration.
+ However, this data is vital for file system access, so
+ the MGT should be reliable storage, preferably mirrored RAID1.</para>
+ <para>MDS storage is accessed in a database-like access pattern with
+ many seeks and read-and-writes of small amounts of data.
+ Storage types that provide much lower seek times, such as SSD or NVMe
+ is strongly preferred for the MDT, and high-RPM SAS is acceptable.</para>
+ <para>For maximum performance, the MDT should be configured as RAID1 with
+ an internal journal and two disks from different controllers.</para>
+ <para>If you need a larger MDT, create multiple RAID1 devices from pairs
+ of disks, and then make a RAID0 array of the RAID1 devices. For ZFS,
+ use <literal>mirror</literal> VDEVs for the MDT. This ensures
+ maximum reliability because multiple disk failures only have a small
+ chance of hitting both disks in the same RAID1 device.</para>
+ <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50%
+ chance that even two disk failures can cause the loss of the whole MDT
+ device. The first failure disables an entire half of the mirror and the
+ second failure has a 50% chance of disabling the remaining mirror.</para>
+ <para condition='l24'>If multiple MDTs are going to be present in the
+ system, each MDT should be specified for the anticipated usage and load.
+ For details on how to add additional MDTs to the filesystem, see
+ <xref linkend="lustremaint.adding_new_mdt"/>.</para>
+ <warning condition='l24'><para>MDT0 contains the root of the Lustre file
+ system. If MDT0 is unavailable for any reason, the file system cannot be
+ used.</para></warning>
+ <note condition='l24'><para>Using the DNE feature it is possible to
+ dedicate additional MDTs to sub-directories off the file system root
+ directory stored on MDT0, or arbitrarily for lower-level subdirectories.
+ using the <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal> command.
+ If an MDT serving a subdirectory becomes unavailable, any subdirectories
+ on that MDT and all directories beneath it will also become inaccessible.
+ Configuring multiple levels of MDTs is an experimental feature for the
+ 2.4 release, and is fully functional in the 2.8 release. This is
+ typically useful for top-level directories to assign different users
+ or projects to separate MDTs, or to distribute other large working sets
+ of files to multiple MDTs.</para></note>
+ <note condition='l28'><para>Starting in the 2.8 release it is possible
+ to spread a single large directory across multiple MDTs using the DNE
+ striped directory feature by specifying multiple stripes (or shards)
+ at creation time using the
+ <literal>lfs mkdir -c <replaceable>stripe_count</replaceable></literal>
+ command, where <replaceable>stripe_count</replaceable> is often the
+ number of MDTs in the filesystem. Striped directories should typically
+ not be used for all directories in the filesystem, since this incurs
+ extra overhead compared to non-striped directories, but is useful for
+ larger directories (over 50k entries) where many output files are being
+ created at one time.
+ </para></note>
+ </section>
+ <section remap="h3">
+ <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
+ <para>The data access pattern for the OSS storage is a streaming I/O
+ pattern that is dependent on the access patterns of applications being
+ used. Each OSS can manage multiple object storage targets (OSTs), one
+ for each volume with I/O traffic load-balanced between servers and
+ targets. An OSS should be configured to have a balance between the
+ network bandwidth and the attached storage bandwidth to prevent
+ bottlenecks in the I/O path. Depending on the server hardware, an OSS
+ typically serves between 2 and 8 targets, with each target between
+ 24-48TB, but may be up to 256 terabytes (TBs) in size.</para>
+ <para>Lustre file system capacity is the sum of the capacities provided
+ by the targets. For example, 64 OSSs, each with two 8 TB OSTs,
+ provide a file system with a capacity of nearly 1 PB. If each OST uses
+ ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6
+ configuration), it may be possible to get 50 MB/sec from each drive,
+ providing up to 400 MB/sec of disk bandwidth per OST. If this system
+ is used as storage backend with a system network, such as the InfiniBand
+ network, that provides a similar bandwidth, then each OSS could provide
+ 800 MB/sec of end-to-end I/O throughput. (Although the architectural
+ constraints described here are simple, in practice it takes careful
+ hardware selection, benchmarking and integration to obtain such
+ results.)</para>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.space_requirements">
+ <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
+ <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
+ Determining Space Requirements</title>
+ <para>The desired performance characteristics of the backing file systems
+ on the MDT and OSTs are independent of one another. The size of the MDT
+ backing file system depends on the number of inodes needed in the total
+ Lustre file system, while the aggregate OST space depends on the total
+ amount of data stored on the file system. If MGS data is to be stored
+ on the MDT device (co-located MGT and MDT), add 100 MB to the required
+ size estimate for the MDT.</para>
+ <para>Each time a file is created on a Lustre file system, it consumes
+ one inode on the MDT and one OST object over which the file is striped.
+ Normally, each file's stripe count is based on the system-wide
+ default stripe count. However, this can be changed for individual files
+ using the <literal>lfs setstripe</literal> option. For more details,
+ see <xref linkend="managingstripingfreespace"/>.</para>
+ <para>In a Lustre ldiskfs file system, all the MDT inodes and OST
+ objects are allocated when the file system is first formatted. When
+ the file system is in use and a file is created, metadata associated
+ with that file is stored in one of the pre-allocated inodes and does
+ not consume any of the free space used to store file data. The total
+ number of inodes on a formatted ldiskfs MDT or OST cannot be easily
+ changed. Thus, the number of inodes created at format time should be
+ generous enough to anticipate near term expected usage, with some room
+ for growth without the effort of additional storage.</para>
+ <para>By default, the ldiskfs file system used by Lustre servers to store
+ user-data objects and system data reserves 5% of space that cannot be used
+ by the Lustre file system. Additionally, an ldiskfs Lustre file system
+ reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal
+ use and a small amount of space outside the journal to store accounting
+ data. This reserved space is unusable for general storage. Thus, at least
+ this much space will be used per OST before any file object data is saved.
+ </para>
+ <para condition="l24">With a ZFS backing filesystem for the MDT or OST,
+ the space allocation for inodes and file data is dynamic, and inodes are
+ allocated as needed. A minimum of 4kB of usable space (before mirroring)
+ is needed for each inode, exclusive of other overhead such as directories,
+ internal log files, extended attributes, ACLs, etc. ZFS also reserves
+ approximately 3% of the total storage space for internal and redundant
+ metadata, which is not usable by Lustre.
+ Since the size of extended attributes and ACLs is highly dependent on
+ kernel versions and site-specific policies, it is best to over-estimate
+ the amount of space needed for the desired number of inodes, and any
+ excess space will be utilized to store more inodes.
+ </para>
+ <section>
+ <title><indexterm>
+ <primary>setup</primary>
+ <secondary>MGT</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>space</primary>
+ <secondary>determining MGT requirements</secondary>
+ </indexterm> Determining MGT Space Requirements</title>
+ <para>Less than 100 MB of space is typically required for the MGT.
+ The size is determined by the total number of servers in the Lustre
+ file system cluster(s) that are managed by the MGS.</para>