From e71e97e9533c977c31992315e28ced53e4bd4b2e Mon Sep 17 00:00:00 2001 From: Richard Henwood Date: Fri, 7 Dec 2012 15:54:21 -0600 Subject: [PATCH] LUDOC-108 dne: Manual includes DNE usage instructions. The manual now includes instructions to: - add a MDT. - remove a MDT. - upgrade to multiple MDT configurations. - designing active-active MDS configurations. - warns against having chained remote directories. - added administrative documentation for lfs. Signed-off-by: Richard Henwood Change-Id: I9b4fb35d7932b9e90561e03b0b72271b9c2728cf Reviewed-on: http://review.whamcloud.com/4773 Tested-by: Hudson --- ConfiguringLustre.xml | 5 ++++ Glossary.xml | 7 +++++ LustreMaintenance.xml | 63 +++++++++++++++++++++++++++++++++++++++---- LustreOperations.xml | 16 ++++++++--- LustreRecovery.xml | 1 + LustreTuning.xml | 2 +- ManagingStripingFreeSpace.xml | 4 +++ SettingUpLustreSystem.xml | 11 ++++++-- UnderstandingFailover.xml | 21 +++++++++++++-- UnderstandingLustre.xml | 6 +++-- UpgradingLustre.xml | 23 ++++++++++++++++ UserUtilities.xml | 3 ++- 12 files changed, 146 insertions(+), 16 deletions(-) diff --git a/ConfiguringLustre.xml b/ConfiguringLustre.xml index 12e813f..a6ee935 100644 --- a/ConfiguringLustre.xml +++ b/ConfiguringLustre.xml @@ -61,6 +61,11 @@ See for more details. + + Optional for Lustre 2.4 and later. Add in additional MDTs. + mkfs.lustre --fsname=<fsname> --mgsnode=<nid> --mdt --index=1 <block device name> + Up to 4095 additional MDTs can be added. + Mount the combined MGS/MDT file system on the block device. On the MDS node, run: mount -t lustre <block device name> <mount point> diff --git a/Glossary.xml b/Glossary.xml index b2f92d4..aabe9c6 100644 --- a/Glossary.xml +++ b/Glossary.xml @@ -398,6 +398,13 @@ Metadata Target. A metadata device made available through the Lustre meta-data network protocol. + + MDT0 + + + The metadata target for the file system root. Since Lustre 2.4, multiple metadata targets are possible in the same filesystem. MDT0 is the root of the filesystem which must be available for the filesystem to be accessible. + + Metadata Write-back Cache diff --git a/LustreMaintenance.xml b/LustreMaintenance.xml index b59943b..ad61a21 100644 --- a/LustreMaintenance.xml +++ b/LustreMaintenance.xml @@ -19,12 +19,21 @@ + + + + + + + + + @@ -261,6 +270,34 @@ Changing a Server NID After the writeconf command is run, the configuration logs are re-generated as servers restart, and server NIDs in the updated list_nids file are used. +
+ <indexterm><primary>maintenance</primary><secondary>adding an MDT</secondary></indexterm>Adding a new MDT to a Lustre file system + Additional MDTs can be added to serve one or more remote sub-directories within the filesystem. It is possible to have multiple remote sub-directories reference the same MDT. However, the root directory will always be located on MDT0. To add a new MDT into the file system: + + + Discover the maximum MDT index. Each MDTs must have unique index. + +client$ lctl dl | grep mdc +36 UP mdc lustre-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 +37 UP mdc lustre-MDT0001-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 +38 UP mdc lustre-MDT0002-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 +39 UP mdc lustre-MDT0003-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 + + + + Add the new block device as a new MDT at the next available index. In this example, the next available index is 4. + +mkfs.lustre --reformat --fsname=<filesystemname> --mdt --mgsnode=<mgsnode> --index 4 <blockdevice> + + + + Mount the MDTs. + +mount –t lustre <blockdevice> /mnt/mdt4 + + + +
<indexterm><primary>maintenance</primary><secondary>adding a OST</secondary></indexterm> Adding a New OST to a Lustre File System @@ -299,6 +336,25 @@ Removing and Restoring OSTs OST is nearing its space capacity +
+ <indexterm><primary>maintenance</primary><secondary>removing a MDT</secondary></indexterm>Removing a MDT from the File System + If the MDT is permanently inaccessible, lfs rmdir {directory} can be used to delete the directory entry. A normal rmdir will report an IO error due to the remote MDT being inactive. After the remote directory has been removed, the administrator should mark the MDT as permanently inactive with: +lctl conf_param {MDT name}.mdc.active=0 + +A user can identify the location of a remote sub-directory using the lfs utility. For example: +client$ lfs getstripe -M /mnt/lustre/remote_dir1 +1 +client$ mkdir /mnt/lustre/local_dir0 +client$ lfs getstripe -M /mnt/lustre/local_dir0 +0 + + The getstripe -M parameters return the index of the MDT that is serving the given directory. +
+
+ + <indexterm><primary>maintenance</primary></indexterm> + <indexterm><primary>maintenance</primary><secondary>inactive MDTs</secondary></indexterm>Working with Inactive MDTs + Files located on or below an inactive MDT are inaccessible until the MDT is activated again. Clients accessing an inactive MDT will receive an EIO error.
<indexterm><primary>maintenance</primary><secondary>removing a OST</secondary></indexterm> Removing an OST from the File System @@ -337,7 +393,7 @@ Removing and Restoring OSTs Temporarily deactivate the OSC on the MDT. On the MDT, run: - $ mdt> lctl --device >devno< deactivate + $ mdt> lctl --device <devno> deactivate For example, based on the command output in Step 1, to deactivate device 13 (the MDT’s OSC for OST-0000), the command would be: $ mdt> lctl --device 13 deactivate @@ -376,11 +432,8 @@ Removing and Restoring OSTs This setting is only temporary and will be reset if the clients or MDS are rebooted. It needs to be run on all clients. - - To permanently disable the deactivated OST, enter: [mgs]# lctl conf_param {OST name}.osc.active=0 - - If there is not expected to be a replacement for this OST in the near future, permanently deactivate the OST on all clients and the MDS: + If there is not expected to be a replacement for this OST in the near future, permanently deactivate the OST on all clients and the MDS: [mgs]# lctl conf_param {OST name}.osc.active=0 A removed OST still appears in the file system; do not create a new OST with the same name. diff --git a/LustreOperations.xml b/LustreOperations.xml index 2dfef82..bcf545d 100644 --- a/LustreOperations.xml +++ b/LustreOperations.xml @@ -25,6 +25,9 @@ + + + @@ -175,6 +178,14 @@ ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1 /dev To mount a client on file system bar at mount point /mnt/bar, run: mount -t lustre mgsnode@tcp0:/bar /mnt/bar
+
+ <indexterm><primary>operations</primary><secondary>remote directory</secondary></indexterm>Creating a sub-directory on a given MDT + Lustre 2.4 enables individual sub-directories to be serviced by unique MDTs. An administrator can allocate a sub-directory to a given MDT using the command: + lfs mkdir –i <mdtindex> <remote_dir> + + This command will allocate the sub-directory remote_dir onto the MDT of index mdtindex. For more information on adding additional MDTs and mdtindex see . + An administrator can allocate remote sub-directories to separate MDTs. Creating remote sub-directories in parent directories not hosted on MDT0 is not recommended. This is because the failure of the parent MDT will leave the namespace below it inaccessible. For this reason, by default it is only possible to create remote sub-directories off MDT0. To relax this restriction and enable remote sub-directories off any MDT, an administrator must issue the command lctl set_param mdd.*.enable_remote_dir=1. +
<indexterm><primary>operations</primary><secondary>parameters</secondary></indexterm>Setting and Retrieving Lustre Parameters Several options are available for setting parameters in Lustre: @@ -344,10 +355,9 @@ mds1> cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status
<indexterm><primary>operations</primary><secondary>replacing an OST or MDS</secondary></indexterm>Replacing an Existing OST or MDT - To copy the contents of an existing OST to a new OST (or an old - MDT to a new MDT), follow the process for either OST/MDT backups in + To copy the contents of an existing OST to a new OST (or an old MDT to a new MDT), follow the process for either OST/MDT backups in or - . + . For more information on removing a MDT, see .
<indexterm><primary>operations</primary><secondary>identifying OSTs</secondary></indexterm>Identifying To Which Lustre File an OST Object Belongs diff --git a/LustreRecovery.xml b/LustreRecovery.xml index cbd7095..6ff87c5 100644 --- a/LustreRecovery.xml +++ b/LustreRecovery.xml @@ -84,6 +84,7 @@ When is enabled, clients are notified of an MDS restart (either the backup or a restored primary). Clients always may detect an MDS failure either by timeouts of in-flight requests or idle-time ping messages. In either case the clients then connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk. The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see . Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients. + Lustre 2.4 introduces multiple metadata targets. If multiple metadata targets are in use, active-active failover is possible. See for more information.
<indexterm><primary>recovery</primary><secondary>OST failure</secondary></indexterm>OST Failure (Failover) diff --git a/LustreTuning.xml b/LustreTuning.xml index f37d5b0..37fc0f6 100644 --- a/LustreTuning.xml +++ b/LustreTuning.xml @@ -105,7 +105,7 @@ Default values for the thread counts are automatically selected. The values are chosen to best exploit the number of CPUs present in the system and to provide best overall performance for typical workloads.
-
+
<indexterm><primary>tuning</primary><secondary>MDS binding</secondary></indexterm>Binding MDS Service Thread to CPU Partitions With the introduction of Node Affinity () in Lustre 2.3, MDS threads can be bound to particular CPU Partitions (CPTs). Default values for bindings are selected automatically to provide good overall performance for a given CPU count. However, an administrator can deviate from these setting if they choose. diff --git a/ManagingStripingFreeSpace.xml b/ManagingStripingFreeSpace.xml index 1fc65be..0fbe78d 100644 --- a/ManagingStripingFreeSpace.xml +++ b/ManagingStripingFreeSpace.xml @@ -207,6 +207,10 @@ group Typical output is: osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp
+
+ <indexterm><primary>striping</primary><secondary>remote directories</secondary></indexterm>Locating the MDT for a remote directory + Lustre 2.4 can be configured with multiple MDTs in the same filesystem. Each sub-directory can have a different MDT. To identify which MDT a given subdirectory is located on, pass the getstripe -M parameters to lfs. An example of this command is provided in the section . +
<indexterm><primary>space</primary><secondary>free space</secondary></indexterm>Managing Free Space diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml index 0f3c398..e621ef6 100644 --- a/SettingUpLustreSystem.xml +++ b/SettingUpLustreSystem.xml @@ -61,6 +61,9 @@ For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers. If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device. Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror. + If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. + MDT0 contains the root of the Lustre filesystem. If MDT0 is unavailable for any reason, the filesystem cannot be used. + Additional MDTs can be dedicated to sub-directories off the root filesystem provided by MDT0. Subsequent directories may also be configured to have their own MDT. If an MDT serving a subdirectory becomes unavailable this subdirectory and all directories beneath it will also become unavailable. Configuring multiple levels of MDTs is an experimental feature for Lustre 2.4.
<indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations @@ -255,9 +258,11 @@ 1 + 4096 - Maximum of 1 MDT per file system, but a single MDS can host multiple MDTs, each one for a separate file system. + Lustre 2.3 and earlier allows a maximum of 1 MDT per file system, but a single MDS can host multiple MDTs, each one for a separate file system. + Lustre 2.4 and later requires one MDT for the root. Upto 4095 additional MDTs can be added to the filesystem and attached into the namespace with remote directories. @@ -379,10 +384,12 @@ 4 billion + 4096 * 4 billion - The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 4 KB of space per inode, meaning 512 million inodes per file system of 2 TB. + The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 2KB of space per inode, meaning 1 billion inodes per file system of 2 TB. This can be increased initially, at the time of MDS file system creation. For more information, see . + Each additional MDT can hold up to 4 billion additional files, depending on available inodes and the distribution directories and files in the filesystem. diff --git a/UnderstandingFailover.xml b/UnderstandingFailover.xml index c6e0cd5..f7d53ca 100644 --- a/UnderstandingFailover.xml +++ b/UnderstandingFailover.xml @@ -51,7 +51,8 @@ Active/active pair - In this configuration, both nodes are active, each providing a subset of resources. In case of a failure, the second node takes over resources from the failed node. - Typically, Lustre MDSs are configured as an active/passive pair, while OSSs are deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS, so no nodes are idle in the cluster. + Before Lustre 2.4 MDSs are configured as an active/passive pair, while OSSs are deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS, so no nodes are idle in the cluster. + Lustre 2.4 introduces metadata targets for individual sub-directories. Active-active failover configurations are available for MDSs that serve MDTs on shared storage.
@@ -61,6 +62,7 @@ For MDT failover, two MDSs are configured to serve the same MDT. Only one MDS node can serve an MDT at a time. + Lustre 2.4 allows multiple MDTs. By placing two or more MDT partitions on storage shared by two MDSs, one MDS can fail and the remaining MDS can begin serving the unserved MDT. This is described as an active/active failover pair. For OST failover, multiple OSS nodes are configured to be able to serve the same OST. However, only one OSS node can serve the OST at a time. An OST can be moved between OSS nodes that have access to the same storage device using umount/mount commands. @@ -82,7 +84,7 @@ In an environment with multiple file systems, the MDSs can be configured in a quasi active/active configuration, with each MDS managing metadata for a subset of the Lustre file system.
- Lustre failover configuration for an MDT + Lustre failover configuration for a active/passive MDT @@ -93,6 +95,21 @@
+
+ <indexterm><primary>failover</primary><secondary>MDT</secondary></indexterm>MDT Failover Configuration (Active/Active) + Multiple MDTs became available with the advent of Lustre 2.4. MDTs can be setup as an active/active failover configuration. A failover cluster is built from two MDSs as shown in . +
+ Lustre failover configuration for a active/active MDTs + + + + + + Lustre failover configuration for two MDTs + + +
+
<indexterm><primary>failover</primary><secondary>OST</secondary></indexterm>OST Failover Configuration (Active/Active) OSTs are usually configured in a load-balanced, active/active failover configuration. A failover cluster is built from two OSSs as shown in . diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index b21fefe..4f2ef12 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -127,6 +127,7 @@ 4 billion files MDS count: 1 primary + 1 backup + Since Lustre 2.4: up to 4096 MDSs and up to 4096 MDTs. Single MDS: @@ -180,7 +181,7 @@ High-performance heterogeneous networking: Lustre supports a variety of high performance, low latency networks and permits Remote Direct Memory Access (RDMA) for Infiniband (OFED) and other advanced networks for fast and efficient network transport. Multiple RDMA networks can be bridged using Lustre routing for maximum performance. Lustre also provides integrated network diagnostics. - High-availability: Lustre offers active/active failover using shared storage partitions for OSS targets (OSTs) and active/passive failover using a shared storage partition for the MDS target (MDT). This allows application transparent recovery. Lustre can work with a variety of high availability (HA) managers to allow automated failover and has no single point of failure (NSPF). Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption. + High-availability: Lustre offers active/active failover using shared storage partitions for OSS targets (OSTs). Lustre 2.3 and earlier offers active/passive failover using a shared storage partition for the MDS target (MDT).With Lustre 2.4 or later servers and clients it is possible to configure active/active failover of multiple MDTsThis allows application transparent recovery. Lustre can work with a variety of high availability (HA) managers to allow automated failover and has no single point of failure (NSPF). Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption. Security: By default TCP connections are only allowed from privileged ports. Unix group membership is verified on the MDS. @@ -255,7 +256,8 @@ Metadata Server (MDS) - The MDS makes metadata stored in one or more MDTs available to Lustre clients. Each MDS manages the names and directories in the Lustre file system(s) and provides network request handling for one or more local MDTs. - Metadata Target (MDT ) - The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS can serve the MDT and make it available to clients. This is referred to as MDS failover. + Metadata Target (MDT ) - For Lustre 2.3 and earlier, each filesystem has one MDT. The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS can serve the MDT and make it available to clients. This is referred to as MDS failover. + Since Lustre 2.4, multiple MDTs are supported. Each filesystem has at least one MDT. An MDT on shared storage target can be available via multiple MDSs, although only one MDS can export the MDT to the clients at one time. Two MDS machines share storage for two or more MDTs. After the failure of one MDS, the remaining MDS begins serving the MDT(s) of the failed MDS. Object Storage Servers (OSS) : The OSS provides file I/O service, and network request handling for one or more local OSTs. Typically, an OSS serves between 2 and 8 OSTs, up to 16 TB each. A typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes. diff --git a/UpgradingLustre.xml b/UpgradingLustre.xml index aa5019a..f9f40ba 100644 --- a/UpgradingLustre.xml +++ b/UpgradingLustre.xml @@ -9,6 +9,9 @@ Upgrading Lustre 1.8 to 2.x + + Upgrading to multiple metadata targets +
<indexterm><primary>Lustre</primary><secondary>upgrading</secondary><see>upgrading</see></indexterm> @@ -22,6 +25,7 @@ <note> <para>Lustre 2.x servers are compatible with clients 1.8.6 and later, though it is strongly recommended that the clients are upgraded to the latest version of Lustre 1.8 available. If you are planning a heterogeneous environment (mixed 1.8 and 2.x servers), make sure that version 1.8.6 or later is installed on the client nodes that are not upgraded to 2.x.</para> </note> + <warning condition='l24'><para>Lustre 2.4 allows remote sub-directories to be hosted on separate MDTs. Clients prior to 2.4 can only see the namespace hosted by MDT0, and will return an IO error if a directory on a remote MDT is accessed.</para></warning> </section> <section xml:id="dbdoclet.50438205_51369"> <title><indexterm><primary>upgrading</primary><secondary>1.8 to 2.x</secondary></indexterm>Upgrading Lustre 1.8 to 2.x @@ -120,4 +124,23 @@ lustre-ldiskfs-<ver> If you have a problem upgrading Lustre, use the wc-discuss mailing list, or file a ticket at the Intel Lustre bug tracker.
+
+ <indexterm><primary>upgrading</primary><secondary>multiple metadata targets</secondary></indexterm>Upgrading to multiple metadata targets + Lustre 2.4 allows separate metadata servers to serve separate sub directories. To upgrade a filesystem to Lustre 2.4 that support multiple metadata servers: + + + Stop MGT/MDT/OST and upgrade to 2.4 + + + Format new MDT according to . + + + Mount all of the targets according to . + + + After recovery is completed clients will be connected MDT0. + Clients prior to 2.4 will only be have the namespace provided by MDT0 visible and will return an IO error if a directory hosted on a remote MDT is accessed. + + +
diff --git a/UserUtilities.xml b/UserUtilities.xml index 5587608..cd2ba01 100644 --- a/UserUtilities.xml +++ b/UserUtilities.xml @@ -45,7 +45,7 @@ lfs getname [-h]|[path...] lfs getstripe [--obd|-O <uuid>] [--quiet|-q] [--verbose|-v] [--count|-c] [--index|-i | --offset|-o] [--size|-s] [--pool|-p] [--directory|-d] - [--recursive|-r] [--raw|-R] <dirname|filename> ... + [--recursive|-r] [--raw|-R] [-M] <dirname|filename> ... lfs setstripe [--size|-s stripe_size] [--count|-c stripe_cnt] [--index|-i|--offset|-o start_ost_index] [--pool|-p <pool>] @@ -308,6 +308,7 @@ lfs help Lists striping information for a given filename or directory. By default, the stripe count, stripe size and offset are returned. If you only want specific striping information, then the options of --count,--size,--index or --offset plus various combinations of these options can be used to retrieve specific information. If the --raw option is specified, the stripe information is printed without substituting the filesystem's default values for unspecified fields. If the striping EA is not set, 0, 0, and -1 will be printed for the stripe count, size, and offset respectively. + The -M prints the index of the MDT for a given directory. See . -- 1.8.3.1