<primary>setup</primary>
<secondary>MDT</secondary>
</indexterm> MGT and MDT Storage Hardware Considerations</title>
- <para>MGT storage requirements are small (less than 100 MB even in the largest Lustre file
- systems), and the data on an MGT is only accessed on a server/client mount, so disk
- performance is not a consideration. However, this data is vital for file system access, so
+ <para>MGT storage requirements are small (less than 100 MB even in the
+ largest Lustre file systems), and the data on an MGT is only accessed
+ on a server/client mount, so disk performance is not a consideration.
+ However, this data is vital for file system access, so
the MGT should be reliable storage, preferably mirrored RAID1.</para>
- <para>MDS storage is accessed in a database-like access pattern with many seeks and
- read-and-writes of small amounts of data. High throughput to MDS storage is not important.
- Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be
- used for the MDT.</para>
- <para>For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.</para>
- <para>If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.</para>
- <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para>
+ <para>MDS storage is accessed in a database-like access pattern with
+ many seeks and read-and-writes of small amounts of data.
+ Storage types that provide much lower seek times, such as SSD or NVMe
+ is strongly preferred for the MDT, and high-RPM SAS is acceptable.</para>
+ <para>For maximum performance, the MDT should be configured as RAID1 with
+ an internal journal and two disks from different controllers.</para>
+ <para>If you need a larger MDT, create multiple RAID1 devices from pairs
+ of disks, and then make a RAID0 array of the RAID1 devices. For ZFS,
+ use <literal>mirror</literal> VDEVs for the MDT. This ensures
+ maximum reliability because multiple disk failures only have a small
+ chance of hitting both disks in the same RAID1 device.</para>
+ <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50%
+ chance that even two disk failures can cause the loss of the whole MDT
+ device. The first failure disables an entire half of the mirror and the
+ second failure has a 50% chance of disabling the remaining mirror.</para>
<para condition='l24'>If multiple MDTs are going to be present in the
system, each MDT should be specified for the anticipated usage and load.
For details on how to add additional MDTs to the filesystem, see
of files per directory, the number of stripes per file, whether files
have ACLs or user xattrs, and the number of hard links per file. The
storage required for Lustre file system metadata is typically 1-2
- percent of the total file system capacity depending upon file size.</para>
+ percent of the total file system capacity depending upon file size.
+ If the <xref linkend="dataonmdt.title"/> feature is in use for Lustre
+ 2.11 or later, MDT space should typically be 5 percent of the total space,
+ depending on the distribution of small files within the filesystem.</para>
+ <para>For ZFS-based MDT filesystems, the number of inodes created on
+ the MDT and OST is dynamic, so there is less need to determine the
+ number of inodes in advance, though there still needs to be some thought
+ given to the total MDT space compared to the total filesystem size.</para>
<para>For example, if the average file size is 5 MiB and you have
100 TiB of usable OST space, then you can calculate the minimum total
number of inodes each for MDTs and OSTs as follows:</para>
</informalexample>
<note>
<para>If the average file size is very small, 4 KB for example, the
- MDT will use as much space for each file as the space used on the OST.
- However, this is an uncommon usage for a Lustre filesystem.</para>
+ MDT will use as much space for each file as the space used on the OST,
+ so the use of Data-on-MDT is strongly recommended.</para>
</note>
<note>
<para>If the MDT has too few inodes, this can cause the space on the
- OSTs to be inaccessible since no new files can be created. Be sure to
- determine the appropriate size of the MDT needed to support the file
- system before formatting the file system. It is possible to increase the
+ OSTs to be inaccessible since no new files can be created. In this
+ case, the <literal>lfs df -i</literal> and <literal>df -i</literal>
+ commands will limit the number of available inodes reported for the
+ filesystem to match the total number of available objects on the OSTs.
+ Be sure to determine the appropriate MDT size needed to support the
+ filesystem before formatting. It is possible to increase the
number of inodes after the file system is formatted, depending on the
storage. For ldiskfs MDT filesystems the <literal>resize2fs</literal>
tool can be used if the underlying block device is on a LVM logical
<para>The number of inodes on the MDT is determined at format time
based on the total size of the file system to be created. The default
<emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
- for an MDT is optimized at one inode for every 2048 bytes of file
- system space. It is recommended that this value not be changed for
- MDTs.</para>
+ for an ldiskfs MDT is optimized at one inode for every 2048 bytes of file
+ system space.</para>
<para>This setting takes into account the space needed for additional
ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB),
bitmaps, and directories, as well as files that Lustre uses internally
to maintain cluster consistency. There is additional per-file metadata
such as file layout for files with a large number of stripes, Access
Control Lists (ACLs), and user extended attributes.</para>
- <para>It is possible to reserve less than the recommended 2048 bytes
+ <para condition="l2B"> Starting in Lustre 2.11, the <xref linkend=
+ "dataonmdt.title"/> feature allows storing small files on the MDT
+ to take advantage of high-performance flash storage, as well as reduce
+ space and network overhead. If you are planning to use the DoM feature
+ with an ldiskfs MDT, it is recommended to <emphasis>increase</emphasis>
+ the inode ratio to have enough space on the MDT for small files.</para>
+ <para>It is possible to change the recommended 2048 bytes
per inode for an ldiskfs MDT when it is first formatted by adding the
<literal>--mkfsoptions="-i bytes-per-inode"</literal> option to
<literal>mkfs.lustre</literal>. Decreasing the inode ratio tunable
<literal>bytes-per-inode</literal> will create more inodes for a given
- MDT size, but will leave less space for extra per-file metadata. The
- inode ratio must always be strictly larger than the MDT inode size,
- which is 512 bytes by default. It is recommended to use an inode ratio
- at least 512 bytes larger than the inode size to ensure the MDT does
- not run out of space.</para>
+ MDT size, but will leave less space for extra per-file metadata and is
+ not recommended. The inode ratio must always be strictly larger than
+ the MDT inode size, which is 1024 bytes by default. It is recommended
+ to use an inode ratio at least 1024 bytes larger than the inode size to
+ ensure the MDT does not run out of space. Increasing the inode ratio
+ to at least hold the most common file size (e.g. 5120 or 66560 bytes if
+ 4KB or 64KB files are widely used) is recommended for DoM.</para>
<para>The size of the inode may be changed by adding the
<literal>--stripe-count-hint=N</literal> to have
<literal>mkfs.lustre</literal> automatically calculate a reasonable