From 19bb368be1efb6faa537c3d6792b9d34b4de62a7 Mon Sep 17 00:00:00 2001 From: Andreas Dilger Date: Mon, 28 Mar 2016 15:35:16 -0600 Subject: [PATCH] LUDOC-306 dne: add more description for DNE usage Add some more references to "DNE" in the text, instead of only in the glossary, so that it is more easily found when searching the manual. Add some more examples and explanation of how and when to use multiple MDTs and striped vs. remote directories. Definitely more could be written on this topic, but it is a start. Signed-off-by: Andreas Dilger Change-Id: Iaec344afe2d909ece0a1935eb33d8a58ad992370 Reviewed-on: http://review.whamcloud.com/19177 Tested-by: Jenkins Reviewed-by: Richard Henwood --- BackupAndRestore.xml | 4 +- Glossary.xml | 10 +-- LustreMaintenance.xml | 50 +++++++------ LustreOperations.xml | 21 ++++-- LustreRecovery.xml | 26 ++++--- ManagingFileSystemIO.xml | 183 +++++++++++++++++++++++----------------------- SettingUpLustreSystem.xml | 137 +++++++++++++++++++++++++--------- UnderstandingLustre.xml | 63 ++++++++-------- 8 files changed, 290 insertions(+), 204 deletions(-) diff --git a/BackupAndRestore.xml b/BackupAndRestore.xml index c8a955f..77f3fb1 100644 --- a/BackupAndRestore.xml +++ b/BackupAndRestore.xml @@ -390,8 +390,8 @@ Changelog records consumed: 42 <indexterm> <primary>backup</primary> - <secondary>MDS/OST device level</secondary> - </indexterm>Backing Up and Restoring an MDS or OST (Device Level) + MDT/OST device level + Backing Up and Restoring an MDT or OST (Device Level) In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the diff --git a/Glossary.xml b/Glossary.xml index a7d104d..172f7d7 100644 --- a/Glossary.xml +++ b/Glossary.xml @@ -82,17 +82,17 @@ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"> Distributed namespace (DNE) - A collection of metadata targets serving a single file + A collection of metadata targets serving a single file system namespace. Prior to DNE, Lustre file systems were limited to a single metadata target for the entire name space. Without the ability to distribute metadata load over multiple targets, Lustre file system performance is limited. Lustre was enhanced with DNE functionality in two development phases. After completing the first phase of development in Lustre software version 2.4, Remote Directories - allowed the metadata for sub-directories to be serviced by an + allows the metadata for sub-directories to be serviced by an independent MDT(s). After completing the second phase of development in Lustre software version 2.8, Striped Directories - allowed files in a single directory to be serviced by multiple MDTs. + allows files in a single directory to be serviced by multiple MDTs. @@ -666,7 +666,7 @@ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"> Remote directory - A remote directory describes a feature of + A remote directory describes a feature of Lustre where metadata for files in a given directory may be stored on a different MDT than the metadata for the parent directory. Remote directories only became possible with the @@ -746,7 +746,7 @@ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"> Striped Directory - A striped directory is a feature of Lustre + A striped directory is a feature of Lustre software where metadata for files in a given directory are distributed evenly over multiple MDTs. Striped directories are only available in Lustre software version 2.8 or later. diff --git a/LustreMaintenance.xml b/LustreMaintenance.xml index 841bd5f..fc46c29 100644 --- a/LustreMaintenance.xml +++ b/LustreMaintenance.xml @@ -268,48 +268,56 @@ Changing a Server NID where devicename is the Lustre target name, e.g. testfs-OST0013 - - If the MGS and MDS share a partition, stop the MGS: - umount mount_point - + + If the MGS and MDS share a partition, stop the MGS: + umount mount_point + - The replace_nids command also cleans all old, invalidated records out of the configuration log, while preserving all other current settings. - The previous configuration log is backed up on the MGS disk with the suffix '.bak'. + The replace_nids command also cleans + all old, invalidated records out of the configuration log, while + preserving all other current settings. + The previous configuration log is backed up on the MGS + disk with the suffix '.bak'.
<indexterm> <primary>maintenance</primary> <secondary>adding an MDT</secondary> </indexterm>Adding a New MDT to a Lustre File System - Additional MDTs can be added to serve one or more remote sub-directories within the - file system. It is possible to have multiple remote sub-directories reference the same MDT. - However, the root directory will always be located on MDT0. To add a new MDT into the file - system: + Additional MDTs can be added using the DNE feature to serve one + or more remote sub-directories within a filesystem, in order to + increase the total number of files that can be created in the + filesystem, to increase aggregate metadata performance, or to isolate + user or application workloads from other users of the filesystem. It + is possible to have multiple remote sub-directories reference the + same MDT. However, the root directory will always be located on + MDT0. To add a new MDT into the file system: - Discover the maximum MDT index. Each MDTs must have unique index. - + Discover the maximum MDT index. Each MDTs must have unique index. + client$ lctl dl | grep mdc 36 UP mdc testfs-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 37 UP mdc testfs-MDT0001-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 38 UP mdc testfs-MDT0002-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 39 UP mdc testfs-MDT0003-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 - + - Add the new block device as a new MDT at the next available index. In this example, the next available index is 4. - -mds# mkfs.lustre --reformat --fsname=filesystem_name --mdt --mgsnode=mgsnode --index 4 /dev/mdt4_device - + Add the new block device as a new MDT at the next available + index. In this example, the next available index is 4. + +mds# mkfs.lustre --reformat --fsname=testfs --mdt --mgsnode=mgsnode --index 4 /dev/mdt4_device + - Mount the MDTs. - + Mount the MDTs. + mds# mount –t lustre /dev/mdt4_blockdevice /mnt/mdt4 - + -
+
<indexterm><primary>maintenance</primary><secondary>adding a OST</secondary></indexterm> Adding a New OST to a Lustre File System diff --git a/LustreOperations.xml b/LustreOperations.xml index aa46def..cbe0f63 100644 --- a/LustreOperations.xml +++ b/LustreOperations.xml @@ -376,9 +376,9 @@ client# lfs mkdir –i This command will allocate the sub-directory remote_dir onto the MDT of index - mdtindex. For more information on adding additional MDTs + mdt_index. For more information on adding additional MDTs and - mdtindex see + mdt_index see . An administrator can allocate remote sub-directories to separate @@ -420,9 +420,11 @@ client# lfs mkdir –i striping metadata Creating a directory striped across multiple MDTs - Lustre 2.8 enables individual files in a given directory to - record their metadata on separate MDTs (a striped - directory). The result of this is that metadata requests for + The Lustre 2.8 DNE feature enables individual files in a given + directory to store their metadata on separate MDTs (a striped + directory) once additional MDTs have been added to the + filesystem, see . + The result of this is that metadata requests for files in a striped directory are serviced by multiple MDTs and metadata service load is distributed over all the MDTs that service a given directory. By distributing metadata service load over multiple MDTs, @@ -430,13 +432,16 @@ client# lfs mkdir –i performance. Prior to the development of this feature all files in a directory must record their metadata on a single MDT. This command to stripe a directory over - mdt_count MDTs is: - + mdt_count MDTs is: + -client# lfs setdirstripe -c +client# lfs mkdir -c mdt_count /mount_point/new_directory + The striped directory feature is most useful for distributing + single large directories (50k entries or more) across multiple MDTs, + since it incurs more overhead than non-striped directories.
diff --git a/LustreRecovery.xml b/LustreRecovery.xml index 67a1832..f9c4210 100644 --- a/LustreRecovery.xml +++ b/LustreRecovery.xml @@ -112,17 +112,21 @@ recovery will take as long as is needed for the single MDS to be restarted.</para> <para>When <xref linkend="imperativerecovery"/> is enabled, clients are notified of an MDS restart (either the backup or a restored primary). Clients always may detect an MDS failure either by timeouts of in-flight requests or idle-time ping messages. In either case the clients then connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para> <para>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the <literal>--failnode=</literal> option to <literal>mkfs.lustre</literal> or <literal>tunefs.lustre</literal>), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <xref linkend="metadatereplay"/>.</para> - <para>Transaction numbers are used to ensure that operations are replayed in the order they - were originally performed, so that they are guaranteed to succeed and present the same file - system state as before the failure. In addition, clients inform the new server of their - existing lock state (including locks that have not yet been granted). All metadata and lock - replay must complete before new, non-recovery operations are permitted. In addition, only - clients that were connected at the time of MDS failure are permitted to reconnect during the - recovery window, to avoid the introduction of state changes that might conflict with what is - being replayed by previously-connected clients.</para> - <para condition="l24">Lustre software release 2.4 introduces multiple metadata targets. If - multiple metadata targets are in use, active-active failover is possible. See <xref - linkend="dbdoclet.mdtactiveactive"/> for more information.</para> + <para>Transaction numbers are used to ensure that operations are + replayed in the order they were originally performed, so that they + are guaranteed to succeed and present the same file system state as + before the failure. In addition, clients inform the new server of their + existing lock state (including locks that have not yet been granted). + All metadata and lock replay must complete before new, non-recovery + operations are permitted. In addition, only clients that were connected + at the time of MDS failure are permitted to reconnect during the recovery + window, to avoid the introduction of state changes that might conflict + with what is being replayed by previously-connected clients.</para> + <para condition="l24">Lustre software release 2.4 introduces multiple + metadata targets. If multiple MDTs are in use, active-active failover + is possible (e.g. two MDS nodes, each actively serving one or more + different MDTs for the same filesystem). See + <xref linkend="dbdoclet.mdtactiveactive"/> for more information.</para> </section> <section remap="h3"> <title><indexterm><primary>recovery</primary><secondary>OST failure</secondary></indexterm>OST Failure (Failover) diff --git a/ManagingFileSystemIO.xml b/ManagingFileSystemIO.xml index 3f3142b..16e8bc8 100644 --- a/ManagingFileSystemIO.xml +++ b/ManagingFileSystemIO.xml @@ -33,31 +33,31 @@ xml:id="managingfilesystemio"> client# lfs df -h UUID bytes Used Available \ Use% Mounted on -lustre-MDT0000_UUID 4.4G 214.5M 3.9G \ -4% /mnt/lustre[MDT:0] -lustre-OST0000_UUID 2.0G 751.3M 1.1G \ -37% /mnt/lustre[OST:0] -lustre-OST0001_UUID 2.0G 755.3M 1.1G \ -37% /mnt/lustre[OST:1] -lustre-OST0002_UUID 2.0G 1.7G 155.1M \ -86% /mnt/lustre[OST:2] <- -lustre-OST0003_UUID 2.0G 751.3M 1.1G \ -37% /mnt/lustre[OST:3] -lustre-OST0004_UUID 2.0G 747.3M 1.1G \ -37% /mnt/lustre[OST:4] -lustre-OST0005_UUID 2.0G 743.3M 1.1G \ -36% /mnt/lustre[OST:5] +testfs-MDT0000_UUID 4.4G 214.5M 3.9G \ +4% /mnt/testfs[MDT:0] +testfs-OST0000_UUID 2.0G 751.3M 1.1G \ +37% /mnt/testfs[OST:0] +testfs-OST0001_UUID 2.0G 755.3M 1.1G \ +37% /mnt/testfs[OST:1] +testfs-OST0002_UUID 2.0G 1.7G 155.1M \ +86% /mnt/testfs[OST:2] **** +testfs-OST0003_UUID 2.0G 751.3M 1.1G \ +37% /mnt/testfs[OST:3] +testfs-OST0004_UUID 2.0G 747.3M 1.1G \ +37% /mnt/testfs[OST:4] +testfs-OST0005_UUID 2.0G 743.3M 1.1G \ +36% /mnt/testfs[OST:5] filesystem summary: 11.8G 5.4G 5.8G \ -45% /mnt/lustre +45% /mnt/testfs - In this case, OST:2 is almost full and when an attempt is made to + In this case, OST0002 is almost full and when an attempt is made to write additional information to the file system (even with uniform striping over all the OSTs), the write command fails as follows: -client# lfs setstripe /mnt/lustre 4M 0 -1 -client# dd if=/dev/zero of=/mnt/lustre/test_3 bs=10M count=100 -dd: writing '/mnt/lustre/test_3': No space left on device +client# lfs setstripe /mnt/testfs 4M 0 -1 +client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100 +dd: writing '/mnt/testfs/test_3': No space left on device 98+0 records in 97+0 records out 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s @@ -92,14 +92,14 @@ mds# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5 2 UP mdt MDS MDS_uuid 3 -3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 -4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5 -5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 -6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 -7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5 -8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5 -9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 -10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5 +3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4 +4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5 +5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5 +6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5 +7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5 +8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5 +9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5 +10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5 @@ -117,14 +117,14 @@ mds# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5 2 UP mdt MDS MDS_uuid 3 -3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 -4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5 -5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 -6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 -7 IN osc lustre-OST0002-osc lustre-mdtlov_UUID 5 -8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5 -9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 -10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5 +3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4 +4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5 +5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5 +6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5 +7 IN osc testfs-OST0002-osc testfs-mdtlov_UUID 5 +8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5 +9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5 +10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5 @@ -148,12 +148,13 @@ mds# lctl dl full OSTs Migrating Data within a File System - Lustre software version 2.8 includes a - feature to migrate metadata between MDTs. This migration can only be - performed on whole directories. To migrate the contents of - /lustre/testremote from the current MDT to - MDT index 0, the sequence of commands is as follows: - $ cd /lustre + Lustre software version 2.8 includes a feature + to migrate metadata (directories and inodes therein) between MDTs. + This migration can only be performed on whole directories. For example, + to migrate the contents of the /testfs/testremote + directory from the MDT it currently resides on to MDT0000, the + sequence of commands is as follows: + $ cd /testfs lfs getdirstripe -M ./testremote which MDT is dir on? 1 $ for i in $(seq 3); do touch ./testremote/${i}.txt; done create test files @@ -169,16 +170,15 @@ $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done For more information, see man lfs - Currently, only whole directories can be migrated + Currently, only whole directories can be migrated between MDTs. During migration each file receives a new identifier - (FID). As a consequence, the file receives a new inode number. File - system tools (for example, backup and archiving tools) may behave - incorrectly with files that are unchanged except for a new inode number. + (FID). As a consequence, the file receives a new inode number. Some + system tools (for example, backup and archiving tools) may consider + the migrated files to be new, even though the contents are unchanged. - As stripes cannot be moved within the file system, data must be - migrated manually by copying and renaming the file, removing the original - file, and renaming the new file with the original file name. The simplest - way to do this is to use the + If there is a need to migrate the file data from the current + OST(s) to new OSTs, the data must be migrated (copied) to the new + location. The simplest way to do this is to use the lfs_migrate command (see ). However, the steps for migrating a file by hand are also shown here for reference. @@ -186,25 +186,26 @@ $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done Identify the file(s) to be moved. In the example below, output from the - getstripe command indicates that the file - test_2 is located entirely on OST2: + lfs getstripe command below shows that the + test_2file is located entirely on OST0002: -client# lfs getstripe /mnt/lustre/test_2 -/mnt/lustre/test_2 +client# lfs getstripe /mnt/testfs/test_2 +/mnt/testfs/test_2 obdidx objid objid group 2 8 0x8 0 - To move single object(s), create a new copy and remove the - original. Enter: + To move the data, create a copy and remove the original: -client# cp -a /mnt/lustre/test_2 /mnt/lustre/test_2.tmp -client# mv /mnt/lustre/test_2.tmp /mnt/lustre/test_2 +client# cp -a /mnt/testfs/test_2 /mnt/testfs/test_2.tmp +client# mv /mnt/testfs/test_2.tmp /mnt/testfs/test_2 - To migrate large files from one or more OSTs, enter: + If the space usage of OSTs is severely imbalanced, it is + possible to find and migrate large files from their current location + onto OSTs that have more space, one could run: client# lfs find --ost ost_name -size +1G | lfs_migrate -y @@ -213,31 +214,31 @@ client# lfs find --ost Check the file system balance. The - df output in the example below shows a more + lfs df output in the example below shows a more balanced system compared to the - df output in the example in + lfs df output in the example in . client# lfs df -h UUID bytes Used Available Use% \ Mounted on -lustre-MDT0000_UUID 4.4G 214.5M 3.9G 4% \ - /mnt/lustre[MDT:0] -lustre-OST0000_UUID 2.0G 1.3G 598.1M 65% \ - /mnt/lustre[OST:0] -lustre-OST0001_UUID 2.0G 1.3G 594.1M 65% \ - /mnt/lustre[OST:1] -lustre-OST0002_UUID 2.0G 913.4M 1000.0M 45% \ - /mnt/lustre[OST:2] -lustre-OST0003_UUID 2.0G 1.3G 602.1M 65% \ - /mnt/lustre[OST:3] -lustre-OST0004_UUID 2.0G 1.3G 606.1M 64% \ - /mnt/lustre[OST:4] -lustre-OST0005_UUID 2.0G 1.3G 610.1M 64% \ - /mnt/lustre[OST:5] +testfs-MDT0000_UUID 4.4G 214.5M 3.9G 4% \ + /mnt/testfs[MDT:0] +testfs-OST0000_UUID 2.0G 1.3G 598.1M 65% \ + /mnt/testfs[OST:0] +testfs-OST0001_UUID 2.0G 1.3G 594.1M 65% \ + /mnt/testfs[OST:1] +testfs-OST0002_UUID 2.0G 913.4M 1000.0M 45% \ + /mnt/testfs[OST:2] +testfs-OST0003_UUID 2.0G 1.3G 602.1M 65% \ + /mnt/testfs[OST:3] +testfs-OST0004_UUID 2.0G 1.3G 606.1M 64% \ + /mnt/testfs[OST:4] +testfs-OST0005_UUID 2.0G 1.3G 610.1M 64% \ + /mnt/testfs[OST:5] filesystem summary: 11.8G 7.3G 3.9G 61% \ -/mnt/lustre +/mnt/testfs @@ -261,14 +262,14 @@ filesystem summary: 11.8G 7.3G 3.9G 61% \ 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-816dd1e813 5 2 UP mdt MDS MDS_uuid 3 - 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 - 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5 - 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 - 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 - 7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5 - 8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5 - 9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 - 10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID + 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4 + 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5 + 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5 + 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5 + 7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5 + 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5 + 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5 + 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID
@@ -397,12 +398,12 @@ mgs# lctl pool_add _UUID are missing, they are automatically added.
For example, to add even-numbered OSTs to pool1 on file system - lustre, run a single command ( + testfs, run a single command ( pool_add) to add many OSTs to the pool at one time: -lctl pool_add lustre.pool1 OST[0-10/2] +lctl pool_add testfs.pool1 OST[0-10/2] @@ -509,9 +510,9 @@ client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost] Add a new OST by passing on the following commands, run: -oss# mkfs.lustre --fsname=spfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda -oss# mkdir -p /mnt/test/ost12 -oss# mount -t lustre /dev/sda /mnt/test/ost12 +oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda +oss# mkdir -p /mnt/testfs/ost12 +oss# mount -t lustre /dev/sda /mnt/testfs/ost12 @@ -653,11 +654,11 @@ $ lctl set_param osc.*.checksum_type= checksum algorithm is now in use.
$ lctl get_param osc.*.checksum_type -osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler] +osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler] $ lctl set_param osc.*.checksum_type=crc32 -osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 +osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 $ lctl get_param osc.*.checksum_type -osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler +osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml index 6053a21..8c5b26c 100644 --- a/SettingUpLustreSystem.xml +++ b/SettingUpLustreSystem.xml @@ -52,7 +52,7 @@ Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients. - + Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are typically used for testing to match expected customer usage and avoid limitations due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. @@ -86,14 +86,36 @@ For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers. If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device. Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror. - If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. - MDT0 contains the root of the Lustre file system. If MDT0 is unavailable for any reason, the - file system cannot be used. - Additional MDTs can be dedicated to sub-directories off the root file system provided by MDT0. - Subsequent directories may also be configured to have their own MDT. If an MDT serving a - subdirectory becomes unavailable this subdirectory and all directories beneath it will - also become unavailable. Configuring multiple levels of MDTs is an experimental feature - for the Lustre software release 2.4. + If multiple MDTs are going to be present in the + system, each MDT should be specified for the anticipated usage and load. + For details on how to add additional MDTs to the filesystem, see + . + MDT0 contains the root of the Lustre file + system. If MDT0 is unavailable for any reason, the file system cannot be + used. + Using the DNE feature it is possible to + dedicate additional MDTs to sub-directories off the file system root + directory stored on MDT0, or arbitrarily for lower-level subdirectories. + using the lfs mkdir -i mdt_index command. + If an MDT serving a subdirectory becomes unavailable, any subdirectories + on that MDT and all directories beneath it will also become inaccessible. + Configuring multiple levels of MDTs is an experimental feature for the + 2.4 release, and is fully functional in the 2.8 release. This is + typically useful for top-level directories to assign different users + or projects to separate MDTs, or to distribute other large working sets + of files to multiple MDTs. + Starting in the 2.8 release it is possible + to spread a single large directory across multiple MDTs using the DNE + striped directory feature by specifying multiple stripes (or shards) + at creation time using the + lfs mkdir -c stripe_count + command, where stripe_count is often the + number of MDTs in the filesystem. Striped directories should typically + not be used for all directories in the filesystem, since this incurs + extra overhead compared to non-striped directories, but is useful for + larger directories (over 50k entries) where many output files are being + created at one time. +
<indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations @@ -159,24 +181,52 @@ space determining MDT requirements Determining MDT Space Requirements - When calculating the MDT size, the important factor to consider is the number of files - to be stored in the file system. This determines the number of inodes needed, which drives - the MDT sizing. To be on the safe side, plan for 2 KB per inode on the MDT, which is the - default value. Attached storage required for Lustre file system metadata is typically 1-2 - percent of the file system capacity depending upon file size. - For example, if the average file size is 5 MB and you have 100 TB of usable OST space, then you can calculate the minimum number of inodes as follows: + When calculating the MDT size, the important factor to consider + is the number of files to be stored in the file system. This determines + the number of inodes needed, which drives the MDT sizing. To be on the + safe side, plan for 2 KB per ldiskfs inode on the MDT, which is the + default value. Attached storage required for Lustre file system metadata + is typically 1-2 percent of the file system capacity depending upon + file size. + Starting in release 2.4, using the DNE + remote directory feature it is possible to increase the metadata + capacity of a single filesystem by configuting additional MDTs into + the filesystem, see . In order + to start creating new files and directories on the new MDT(s) they + need to be attached into the namespace at one or more subdirectories + using the lfs mkdir command. + For example, if the average file size is 5 MB and you have + 100 TB of usable OST space, then you can calculate the minimum number + of inodes as follows: (100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes - We recommend that you use at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the required space is: + It is recommended that the MDT have at least twice the minimum + number of inodes to allow for future expansion and allow for an average + file size smaller than expected. Thus, the required space is: - 2 KB/inode * 40 million inodes = 80 GB + 2 KB/inode x 20 million inodes x 2 = 80 GB - If the average file size is small, 4 KB for example, the Lustre file system is not very - efficient as the MDT uses as much space as the OSTs. However, this is not a common - configuration for a Lustre environment. + If the average file size is small, 4 KB for example, the Lustre + file system is not very efficient as the MDT will use as much space + for each file as the space used on the OST. However, this is not a + common configuration for a Lustre environment. - If the MDT is too small, this can cause all the space on the OSTs to be unusable. Be sure to determine the appropriate size of the MDT needed to support the file system before formatting the file system. It is difficult to increase the number of inodes after the file system is formatted. + If the MDT is too small, this can cause the space on the OSTs + to be inaccessible since no new files can be created. Be sure to + determine the appropriate size of the MDT needed to support the file + system before formatting the file system. It is possible to increase the + number of inodes after the file system is formatted, depending on the + storage. For ldiskfs MDT filesystems the resize2fs + tool can be used if the underlying block device is on a LVM logical + volume. For ZFS new (mirrored) VDEVs can be added to the MDT pool. + Inodes will be added approximately in proportion to space added. + + It is also possible to increase the number + of inodes available, as well as increasing the aggregate metadata + performance, by adding additional MDTs using the DNE remote directory + feature available in Lustre release 2.4 and later, see + .
@@ -523,11 +573,17 @@ 10 million files (ldiskfs), 2^48 (ZFS) - The Lustre software uses the ldiskfs hashed directory code, which has a limit - of about 10 million files depending on the length of the file name. The limit on - subdirectories is the same as the limit on regular files. - Lustre file systems are tested with ten million files in a single - directory. + The Lustre software uses the ldiskfs hashed directory + code, which has a limit of about 10 million files, depending + on the length of the file name. The limit on subdirectories + is the same as the limit on regular files. + Starting in the 2.8 release it is + possible to exceed this limit by striping a single directory + over multiple MDTs with the lfs mkdir -c + command, which increases the single directory limit by a + factor of the number of directory stripes used. + Lustre file systems are tested with ten million files + in a single directory. @@ -536,14 +592,24 @@ 4 billion (ldiskfs), 256 trillion (ZFS) - 4096 times the per-MDT limit - - - The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 2KB of space per inode, meaning 1 billion inodes per file system of 2 TB. - This can be increased initially, at the time of MDS file system creation. For more information, see . - Each additional MDT can hold up to the above maximum number of additional files, depending - on available space and the distribution directories and files in the file - system. + up to 256 times the per-MDT limit + + + The ldiskfs filesystem imposes an upper limit of + 4 billion inodes per filesystem. By default, the MDT + filesystem is formatted with one inode per 2KB of space, + meaning 512 million inodes per TB of MDT space. This can be + increased initially at the time of MDT filesystem creation. + For more information, see + . + The ZFS filesystem + dynamically allocates inodes and does not have a fixed ratio + of inodes per unit of MDT space, but consumes approximately + 4KB of space per inode, depending on the configuration. + Each additional MDT can hold up to the + above maximum number of additional files, depending on + available space and the distribution directories and files + in the filesystem. @@ -554,7 +620,8 @@ 255 bytes (filename) - This limit is 255 bytes for a single filename, the same as the limit in the underlying file systems. + This limit is 255 bytes for a single filename, the + same as the limit in the underlying filesystems. diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index 5332990..a9d831f 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -133,7 +133,7 @@ xml:id="understandinglustre"> Aggregate: - 2.5 TB/sec I/O + 10 TB/sec I/O @@ -187,7 +187,7 @@ xml:id="understandinglustre"> Single OSS: - 5 GB/sec + 10 GB/sec Aggregate: @@ -197,7 +197,7 @@ xml:id="understandinglustre"> Single OSS: - 2.0+ GB/sec + 6.0+ GB/sec Aggregate: @@ -489,7 +489,7 @@ xml:id="understandinglustre"> - Metadata Server (MDS)- The MDS makes + Metadata Servers (MDS)- The MDS makes metadata stored in one or more MDTs available to Lustre clients. Each MDS manages the names and directories in the Lustre file system(s) and provides network request handling for one or more local @@ -497,7 +497,7 @@ xml:id="understandinglustre"> - Metadata Target (MDT) - For Lustre + Metadata Targets (MDT) - For Lustre software release 2.3 and earlier, each file system has one MDT. The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one @@ -506,19 +506,14 @@ xml:id="understandinglustre"> fails, a standby MDS can serve the MDT and make it available to clients. This is referred to as MDS failover. Since Lustre software release 2.4, multiple - MDTs are supported. Each file system has at least one MDT. An MDT on - a shared storage target can be available via multiple MDSs, although - only one MDS can export the MDT to the clients at one time. Two MDS - machines share storage for two or more MDTs. After the failure of one - MDS, the remaining MDS begins serving the MDT(s) of the failed - MDS. - Since Lustre software release 2.8, - multiple MDTs can be employed to share the inode records for files - contained in a single directory. A directory for which inode records - are distributed across multiple MDTs is known as a striped - directory. In the case of a Lustre filesystem the inode - records maybe also be referred to as the 'metadata' portion of the - file record. + MDTs are supported in the Distributed Namespace Environment (DNE). + In addition to the primary MDT that holds the filesystem root, it + is possible to add additional MDS nodes, each with their own MDTs, + to hold sub-directory trees of the filesystem. + Since Lustre software release 2.8, DNE also + allows the filesystem to distribute files of a single directory over + multiple MDT nodes. A directory which is distributed across multiple + MDTs is known as a striped directory. @@ -556,6 +551,13 @@ xml:id="understandinglustre"> Several clients can write to different parts of the same file simultaneously, while, at the same time, other clients can read from the file. + A logical metadata volume (LMV) aggregates the MDCs to provide + transparent access across all the MDTs in a similar manner as the LOV + does for file access. This allows the client to see the directory tree + on multiple MDTs as a single coherent namespace, and striped directories + are merged on the clients to form a single visible directory to users + and applications. + provides the requirements for attached storage for each Lustre file system component @@ -613,11 +615,11 @@ xml:id="understandinglustre"> - 1-16 TB per OST, 1-8 OSTs per OSS + 1-128 TB per OST, 1-8 OSTs per OSS Good bus bandwidth. Recommended that storage be balanced - evenly across OSSs. + evenly across OSSs and matched to network bandwidth. @@ -627,7 +629,7 @@ xml:id="understandinglustre"> - None + No local storage needed Low latency, high bandwidth network. @@ -700,7 +702,7 @@ xml:id="understandinglustre"> MDTs). This change enabled future support for multiple MDTs (introduced in Lustre software release 2.4) and ZFS (introduced in Lustre software release 2.4). - Also introduced in release 2.0 is a feature call + Also introduced in release 2.0 is an ldiskfs feature named FID-in-dirent(also known as dirdata) in which the FID is stored as part of the name of the file in the parent directory. This feature @@ -708,13 +710,12 @@ xml:id="understandinglustre"> ls command executions by reducing disk I/O. The FID-in-dirent is generated at the time the file is created. - The FID-in-dirent feature is not compatible with the Lustre - software release 1.8 format. Therefore, when an upgrade from Lustre - software release 1.8 to a Lustre software release 2.x is performed, the - FID-in-dirent feature is not automatically enabled. For upgrades from - Lustre software release 1.8 to Lustre software releases 2.0 through 2.3, - FID-in-dirent can be enabled manually but only takes effect for new - files. + The FID-in-dirent feature is not backward compatible with the + release 1.8 ldiskfs disk format. Therefore, when an upgrade from + release 1.8 to release 2.x is performed, the FID-in-dirent feature is + not automatically enabled. For upgrades from release 1.8 to releases + 2.0 through 2.3, FID-in-dirent can be enabled manually but only takes + effect for new files. For more information about upgrading from Lustre software release 1.8 and enabling FID-in-dirent for existing files, see if it is invalid or missing. The linkEAconsists of the file name and parent FID. It is stored as an extended attribute in the file - itself. Thus, the linkEA can be used to reconstruct the full path name of - a file. + itself. Thus, the linkEA can be used to reconstruct the full path name + of a file. Information about where file data is located on the OST(s) is stored -- 1.8.3.1