From 5fd65770942c45e10f88d8d414ce799937931392 Mon Sep 17 00:00:00 2001 From: Andreas Dilger Date: Thu, 26 Jul 2018 12:56:21 -0600 Subject: [PATCH] LU-11181 mdt: include DoM in inode ratio discussion Reference Data-on-MDT in the discussion about ldiskfs inode ratios. We default to 2KB/inode on the MDT, but with DoM we may want to change this to 64KB/inode or higher, depending on the average size of small files stored on the MDT. Also mention that "lfs df -i" and "df -i" will limit the reported available inode count to the total number of available inodes on the OSTs. Signed-off-by: Andreas Dilger Change-Id: I53c4d35129ffb5c7732771a6ddab3aa723758872 Reviewed-on: https://review.whamcloud.com/32887 Tested-by: Jenkins Reviewed-by: Joseph Gmitter --- SettingUpLustreSystem.xml | 76 +++++++++++++++++++++++++++++++---------------- 1 file changed, 51 insertions(+), 25 deletions(-) diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml index 49bd756..5ca5eb7 100644 --- a/SettingUpLustreSystem.xml +++ b/SettingUpLustreSystem.xml @@ -75,17 +75,26 @@ setup MDT MGT and MDT Storage Hardware Considerations - MGT storage requirements are small (less than 100 MB even in the largest Lustre file - systems), and the data on an MGT is only accessed on a server/client mount, so disk - performance is not a consideration. However, this data is vital for file system access, so + MGT storage requirements are small (less than 100 MB even in the + largest Lustre file systems), and the data on an MGT is only accessed + on a server/client mount, so disk performance is not a consideration. + However, this data is vital for file system access, so the MGT should be reliable storage, preferably mirrored RAID1. - MDS storage is accessed in a database-like access pattern with many seeks and - read-and-writes of small amounts of data. High throughput to MDS storage is not important. - Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be - used for the MDT. - For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers. - If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device. - Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror. + MDS storage is accessed in a database-like access pattern with + many seeks and read-and-writes of small amounts of data. + Storage types that provide much lower seek times, such as SSD or NVMe + is strongly preferred for the MDT, and high-RPM SAS is acceptable. + For maximum performance, the MDT should be configured as RAID1 with + an internal journal and two disks from different controllers. + If you need a larger MDT, create multiple RAID1 devices from pairs + of disks, and then make a RAID0 array of the RAID1 devices. For ZFS, + use mirror VDEVs for the MDT. This ensures + maximum reliability because multiple disk failures only have a small + chance of hitting both disks in the same RAID1 device. + Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% + chance that even two disk failures can cause the loss of the whole MDT + device. The first failure disables an entire half of the mirror and the + second failure has a 50% chance of disabling the remaining mirror. If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. For details on how to add additional MDTs to the filesystem, see @@ -219,7 +228,14 @@ of files per directory, the number of stripes per file, whether files have ACLs or user xattrs, and the number of hard links per file. The storage required for Lustre file system metadata is typically 1-2 - percent of the total file system capacity depending upon file size. + percent of the total file system capacity depending upon file size. + If the feature is in use for Lustre + 2.11 or later, MDT space should typically be 5 percent of the total space, + depending on the distribution of small files within the filesystem. + For ZFS-based MDT filesystems, the number of inodes created on + the MDT and OST is dynamic, so there is less need to determine the + number of inodes in advance, though there still needs to be some thought + given to the total MDT space compared to the total filesystem size. For example, if the average file size is 5 MiB and you have 100 TiB of usable OST space, then you can calculate the minimum total number of inodes each for MDTs and OSTs as follows: @@ -238,14 +254,17 @@ If the average file size is very small, 4 KB for example, the - MDT will use as much space for each file as the space used on the OST. - However, this is an uncommon usage for a Lustre filesystem. + MDT will use as much space for each file as the space used on the OST, + so the use of Data-on-MDT is strongly recommended. If the MDT has too few inodes, this can cause the space on the - OSTs to be inaccessible since no new files can be created. Be sure to - determine the appropriate size of the MDT needed to support the file - system before formatting the file system. It is possible to increase the + OSTs to be inaccessible since no new files can be created. In this + case, the lfs df -i and df -i + commands will limit the number of available inodes reported for the + filesystem to match the total number of available objects on the OSTs. + Be sure to determine the appropriate MDT size needed to support the + filesystem before formatting. It is possible to increase the number of inodes after the file system is formatted, depending on the storage. For ldiskfs MDT filesystems the resize2fs tool can be used if the underlying block device is on a LVM logical @@ -354,25 +373,32 @@ The number of inodes on the MDT is determined at format time based on the total size of the file system to be created. The default bytes-per-inode ratio ("inode ratio") - for an MDT is optimized at one inode for every 2048 bytes of file - system space. It is recommended that this value not be changed for - MDTs. + for an ldiskfs MDT is optimized at one inode for every 2048 bytes of file + system space. This setting takes into account the space needed for additional ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB), bitmaps, and directories, as well as files that Lustre uses internally to maintain cluster consistency. There is additional per-file metadata such as file layout for files with a large number of stripes, Access Control Lists (ACLs), and user extended attributes. - It is possible to reserve less than the recommended 2048 bytes + Starting in Lustre 2.11, the feature allows storing small files on the MDT + to take advantage of high-performance flash storage, as well as reduce + space and network overhead. If you are planning to use the DoM feature + with an ldiskfs MDT, it is recommended to increase + the inode ratio to have enough space on the MDT for small files. + It is possible to change the recommended 2048 bytes per inode for an ldiskfs MDT when it is first formatted by adding the --mkfsoptions="-i bytes-per-inode" option to mkfs.lustre. Decreasing the inode ratio tunable bytes-per-inode will create more inodes for a given - MDT size, but will leave less space for extra per-file metadata. The - inode ratio must always be strictly larger than the MDT inode size, - which is 512 bytes by default. It is recommended to use an inode ratio - at least 512 bytes larger than the inode size to ensure the MDT does - not run out of space. + MDT size, but will leave less space for extra per-file metadata and is + not recommended. The inode ratio must always be strictly larger than + the MDT inode size, which is 1024 bytes by default. It is recommended + to use an inode ratio at least 1024 bytes larger than the inode size to + ensure the MDT does not run out of space. Increasing the inode ratio + to at least hold the most common file size (e.g. 5120 or 66560 bytes if + 4KB or 64KB files are widely used) is recommended for DoM. The size of the inode may be changed by adding the --stripe-count-hint=N to have mkfs.lustre automatically calculate a reasonable -- 1.8.3.1